Files
hakmem/docs/design/PHASE12_SHARED_SUPERSLAB_POOL_DESIGN.md
Moe Charm (CI) a9ddb52ad4 ENV cleanup: Remove BG/HotMag vars & guard fprintf (Larson 52.3M ops/s)
Phase 1 完了:環境変数整理 + fprintf デバッグガード

ENV変数削除(BG/HotMag系):
- core/hakmem_tiny_init.inc: HotMag ENV 削除 (~131 lines)
- core/hakmem_tiny_bg_spill.c: BG spill ENV 削除
- core/tiny_refill.h: BG remote 固定値化
- core/hakmem_tiny_slow.inc: BG refs 削除

fprintf Debug Guards (#if !HAKMEM_BUILD_RELEASE):
- core/hakmem_shared_pool.c: Lock stats (~18 fprintf)
- core/page_arena.c: Init/Shutdown/Stats (~27 fprintf)
- core/hakmem.c: SIGSEGV init message

ドキュメント整理:
- 328 markdown files 削除(旧レポート・重複docs)

性能確認:
- Larson: 52.35M ops/s (前回52.8M、安定動作)
- ENV整理による機能影響なし
- Debug出力は一部残存(次phase で対応)

🤖 Generated with Claude Code

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-26 14:45:26 +09:00

12 KiB
Raw Blame History

Phase 12: Shared SuperSlab Pool - Design Document

Date: 2025-11-13 Goal: System malloc parity (90M ops/s) via mimalloc-style shared SuperSlab architecture Expected Impact: SuperSlab count 877 → 100-200 (-70-80%), +650-860% performance


🎯 Problem Statement

Root Cause: Fixed Size Class Architecture

Current Design (Phase 11):

// SuperSlab is bound to ONE size class
struct SuperSlab {
    uint8_t size_class;  // FIXED at allocation time (0-7)
    // ... 32 slabs, all for the SAME class
};

// 8 independent SuperSlabHead structures (one per class)
SuperSlabHead g_superslab_heads[8];  // Each class manages its own pool

Problem:

  • Benchmark (100K iterations, 256B): 877 SuperSlabs allocated
  • Memory usage: 877MB (877 × 1MB SuperSlabs)
  • Metadata overhead: 877 × ~2KB headers = ~1.8MB
  • Each size class independently allocates SuperSlabs → massive churn

Why 877?:

Class 0 (8B):    ~100 SuperSlabs
Class 1 (16B):   ~120 SuperSlabs
Class 2 (32B):   ~150 SuperSlabs
Class 3 (64B):   ~180 SuperSlabs
Class 4 (128B):  ~140 SuperSlabs
Class 5 (256B):  ~187 SuperSlabs  ← Target class for benchmark
Class 6 (512B):  ~80 SuperSlabs
Class 7 (1KB):   ~20 SuperSlabs
Total:           877 SuperSlabs

Performance Impact:

  • Massive metadata traversal overhead
  • Poor cache locality (877 scattered 1MB regions)
  • Excessive TLB pressure
  • SuperSlab allocation churn dominates runtime

🚀 Solution: Shared SuperSlab Pool (mimalloc-style)

Core Concept

New Design (Phase 12):

// SuperSlab is NOT bound to any class - slabs are dynamically assigned
struct SuperSlab {
    // NO size_class field! Each slab has its own class_idx
    uint8_t active_slabs;       // Number of active slabs (any class)
    uint32_t slab_bitmap;       // 32-bit bitmap (1=active, 0=free)
    // ... 32 slabs, EACH can be a different size class
};

// Single global pool (shared by all classes)
typedef struct SharedSuperSlabPool {
    SuperSlab** slabs;          // Array of all SuperSlabs
    uint32_t total_count;       // Total SuperSlabs allocated
    uint32_t active_count;      // SuperSlabs with active slabs
    pthread_mutex_t lock;       // Allocation lock

    // Per-class hints (fast path optimization)
    SuperSlab* class_hints[8];  // Last known SuperSlab with free space per class
} SharedSuperSlabPool;

Per-Slab Dynamic Class Assignment

Old (TinySlabMeta):

// Slab metadata (16 bytes) - class_idx inherited from SuperSlab
typedef struct TinySlabMeta {
    void*    freelist;
    uint16_t used;
    uint16_t capacity;
    uint16_t carved;
    uint16_t owner_tid;
} TinySlabMeta;

New (Phase 12):

// Slab metadata (16 bytes) - class_idx is PER-SLAB
typedef struct TinySlabMeta {
    void*    freelist;
    uint16_t used;
    uint16_t capacity;
    uint16_t carved;
    uint8_t  class_idx;     // NEW: Dynamic class assignment (0-7, 255=unassigned)
    uint8_t  owner_tid_low; // Truncated to 8-bit (from 16-bit)
} TinySlabMeta;

Size preserved: Still 16 bytes (no growth!)


📐 Architecture Changes

1. SuperSlab Structure (superslab_types.h)

Remove:

uint8_t size_class;  // DELETE - no longer per-SuperSlab

Add (optional, for debugging):

uint8_t mixed_slab_count;  // Number of slabs with different class_idx (stats)

2. TinySlabMeta Structure (superslab_types.h)

Modify:

typedef struct TinySlabMeta {
    void*    freelist;
    uint16_t used;
    uint16_t capacity;
    uint16_t carved;
    uint8_t  class_idx;     // NEW: 0-7 for active, 255=unassigned
    uint8_t  owner_tid_low; // Changed from uint16_t owner_tid
} TinySlabMeta;

3. Shared Pool Structure (NEW: hakmem_shared_pool.h)

// Global shared pool (singleton)
typedef struct SharedSuperSlabPool {
    SuperSlab** slabs;          // Dynamic array of SuperSlab pointers
    uint32_t capacity;          // Array capacity (grows as needed)
    uint32_t total_count;       // Total SuperSlabs allocated
    uint32_t active_count;      // SuperSlabs with >0 active slabs

    pthread_mutex_t alloc_lock; // Lock for slab allocation

    // Per-class hints (lock-free read, updated under lock)
    SuperSlab* class_hints[TINY_NUM_CLASSES];

    // LRU cache integration (Phase 9)
    SuperSlab* lru_head;
    SuperSlab* lru_tail;
    uint32_t lru_count;
} SharedSuperSlabPool;

// Global singleton
extern SharedSuperSlabPool g_shared_pool;

// API
void shared_pool_init(void);
SuperSlab* shared_pool_acquire_superslab(void);  // Get/allocate SuperSlab
int shared_pool_acquire_slab(int class_idx, SuperSlab** ss_out, int* slab_idx_out);
void shared_pool_release_slab(SuperSlab* ss, int slab_idx);

4. Allocation Flow (NEW)

Old Flow (Phase 11):

1. TLS cache miss for class C
2. Check g_superslab_heads[C].current_chunk
3. If no space → allocate NEW SuperSlab for class C
4. All 32 slabs in new SuperSlab belong to class C

New Flow (Phase 12):

1. TLS cache miss for class C
2. Check g_shared_pool.class_hints[C]
3. If hint has free slab → assign that slab to class C (set class_idx=C)
4. If no hint:
   a. Scan g_shared_pool.slabs[] for any SuperSlab with free slab
   b. If found → assign slab to class C
   c. If not found → allocate NEW SuperSlab (add to pool)
5. Update class_hints[C] for fast path

Key Benefit: NEW SuperSlab only allocated when ALL existing SuperSlabs are full!


🔧 Implementation Plan

Phase 12-1: Dynamic Slab Metadata (Current Task)

Files to modify:

  • core/superslab/superslab_types.h - Add class_idx to TinySlabMeta
  • core/superslab/superslab_types.h - Remove size_class from SuperSlab

Changes:

// TinySlabMeta: Add class_idx field
typedef struct TinySlabMeta {
    void*    freelist;
    uint16_t used;
    uint16_t capacity;
    uint16_t carved;
    uint8_t  class_idx;      // NEW: 0-7 for active, 255=UNASSIGNED
    uint8_t  owner_tid_low;  // Changed from uint16_t
} TinySlabMeta;

// SuperSlab: Remove size_class
typedef struct SuperSlab {
    uint64_t magic;
    // uint8_t size_class;   // REMOVED!
    uint8_t active_slabs;
    uint8_t lg_size;
    uint8_t _pad0;
    // ... rest unchanged
} SuperSlab;

Compatibility shim (temporary, for gradual migration):

// Provide backward-compatible size_class accessor
static inline int superslab_get_class(SuperSlab* ss, int slab_idx) {
    return ss->slabs[slab_idx].class_idx;
}

Phase 12-2: Shared Pool Infrastructure

New file: core/hakmem_shared_pool.h, core/hakmem_shared_pool.c

Functionality:

  • shared_pool_init() - Initialize global pool
  • shared_pool_acquire_slab() - Get free slab for class_idx
  • shared_pool_release_slab() - Mark slab as free (class_idx=255)
  • shared_pool_gc() - Garbage collect empty SuperSlabs

Data structure:

// Global pool (singleton)
SharedSuperSlabPool g_shared_pool = {
    .slabs = NULL,
    .capacity = 0,
    .total_count = 0,
    .active_count = 0,
    .alloc_lock = PTHREAD_MUTEX_INITIALIZER,
    .class_hints = {NULL},
    .lru_head = NULL,
    .lru_tail = NULL,
    .lru_count = 0
};

Phase 12-3: Refill Path Integration

Files to modify:

  • core/hakmem_tiny_refill_p0.inc.h - Update to use shared pool
  • core/tiny_superslab_alloc.inc.h - Replace per-class allocation with shared pool

Key changes:

// OLD: superslab_refill(int class_idx)
static SuperSlab* superslab_refill_old(int class_idx) {
    SuperSlabHead* head = &g_superslab_heads[class_idx];
    // ... allocate SuperSlab for class_idx only
}

// NEW: superslab_refill(int class_idx) - use shared pool
static SuperSlab* superslab_refill_new(int class_idx) {
    SuperSlab* ss = NULL;
    int slab_idx = -1;

    // Try to acquire a free slab from shared pool
    if (shared_pool_acquire_slab(class_idx, &ss, &slab_idx) == 0) {
        // SUCCESS: Got a slab assigned to class_idx
        return ss;
    }

    // FAILURE: All SuperSlabs full, need to allocate new one
    // (This should be RARE after pool grows to steady-state)
    return NULL;
}

Phase 12-4: Free Path Integration

Files to modify:

  • core/tiny_free_fast.inc.h - Update to handle dynamic class_idx
  • core/tiny_superslab_free.inc.h - Update to release slabs back to pool

Key changes:

// OLD: Free assumes slab belongs to ss->size_class
static inline void hak_tiny_free_superslab_old(void* ptr, SuperSlab* ss) {
    int class_idx = ss->size_class;  // FIXED class
    // ... free logic
}

// NEW: Free reads class_idx from slab metadata
static inline void hak_tiny_free_superslab_new(void* ptr, SuperSlab* ss, int slab_idx) {
    int class_idx = ss->slabs[slab_idx].class_idx;  // DYNAMIC class

    // ... free logic

    // If slab becomes empty, release back to pool
    if (ss->slabs[slab_idx].used == 0) {
        shared_pool_release_slab(ss, slab_idx);
        ss->slabs[slab_idx].class_idx = 255;  // Mark as unassigned
    }
}

Phase 12-5: Testing & Benchmarking

Validation:

  1. Correctness: Run bench_fixed_size_hakmem 100K iterations (all classes)
  2. SuperSlab count: Monitor g_shared_pool.total_count (expect 100-200)
  3. Performance: bench_random_mixed_hakmem (expect 70-90M ops/s)

Expected results:

Metric Phase 11 (Before) Phase 12 (After) Improvement
SuperSlab count 877 100-200 -70-80%
Memory usage 877MB 100-200MB -70-80%
Metadata overhead ~1.8MB ~0.2-0.4MB -78-89%
Performance 9.38M ops/s 70-90M ops/s +650-860%

⚠️ Risk Analysis

Complexity Risks

  1. Concurrency: Shared pool requires careful locking

    • Mitigation: Per-class hints reduce contention (lock-free fast path)
  2. Fragmentation: Mixed classes in same SuperSlab may increase fragmentation

    • Mitigation: Smart slab assignment (prefer same-class SuperSlabs)
  3. Debugging: Dynamic class_idx makes debugging harder

    • Mitigation: Add runtime validation (class_idx sanity checks)

Performance Risks

  1. Lock contention: Shared pool lock may become bottleneck

    • Mitigation: Per-class hints + fast path bypass lock 90%+ of time
  2. Cache misses: Accessing distant SuperSlabs may reduce locality

    • Mitigation: LRU cache keeps hot SuperSlabs resident

📊 Success Metrics

Primary Goals

  1. SuperSlab count: 877 → 100-200 (-70-80%)
  2. Performance: 9.38M → 70-90M ops/s (+650-860%)
  3. Memory usage: 877MB → 100-200MB (-70-80%)

Stretch Goals

  1. System malloc parity: 90M ops/s (100% of target) 🎯
  2. Scalability: Maintain performance with 4T+ threads
  3. Fragmentation: <10% internal fragmentation

🔄 Migration Strategy

Phase 12-1: Metadata (Low Risk)

  • Add class_idx to TinySlabMeta (16B preserved)
  • Remove size_class from SuperSlab
  • Add backward-compatible shim

Phase 12-2: Infrastructure (Medium Risk)

  • Implement shared pool (NEW code, isolated)
  • No changes to existing paths yet

Phase 12-3: Integration (High Risk)

  • Update refill path to use shared pool
  • Update free path to handle dynamic class_idx
  • Critical: Extensive testing required

Phase 12-4: Cleanup (Low Risk)

  • Remove per-class SuperSlabHead structures
  • Remove backward-compatible shims
  • Final optimization pass

📝 Next Steps

Immediate (Phase 12-1)

  1. Update superslab_types.h - Add class_idx to TinySlabMeta
  2. Update superslab_types.h - Remove size_class from SuperSlab
  3. Add backward-compatible shim superslab_get_class()
  4. Fix compilation errors (grep for ss->size_class)

Next (Phase 12-2)

  1. Implement hakmem_shared_pool.h/c
  2. Write unit tests for shared pool
  3. Integrate with LRU cache (Phase 9)

Then (Phase 12-3+)

  1. Update refill path
  2. Update free path
  3. Benchmark & validate
  4. Cleanup & optimize

Status: 🚧 Phase 12-1 (Metadata) - IN PROGRESS Expected completion: Phase 12-1 today, Phase 12-2 tomorrow, Phase 12-3 day after Total estimated time: 3-4 days for full implementation