2025-11-26 13:14:18 +09:00
|
|
|
# HAKMEM Tiny Allocator Feature Audit & Removal List
|
|
|
|
|
|
|
|
|
|
## Methodology
|
|
|
|
|
|
|
|
|
|
This audit identifies features in `tiny_alloc_fast()` that should be removed based on:
|
|
|
|
|
1. **Performance impact**: A/B tests showing regression
|
|
|
|
|
2. **Redundancy**: Overlapping functionality with better alternatives
|
|
|
|
|
3. **Complexity**: High maintenance cost vs benefit
|
|
|
|
|
4. **Usage**: Disabled by default, never enabled in production
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
## Features to REMOVE (Immediate)
|
|
|
|
|
|
|
|
|
|
### 1. UltraHot (Phase 14) - **DELETE**
|
|
|
|
|
|
|
|
|
|
**Location**: `tiny_alloc_fast.inc.h:669-686`
|
|
|
|
|
|
|
|
|
|
**Code**:
|
|
|
|
|
```c
|
|
|
|
|
if (__builtin_expect(ultra_hot_enabled() && front_prune_ultrahot_enabled(), 0)) {
|
|
|
|
|
void* base = ultra_hot_alloc(size);
|
|
|
|
|
if (base) {
|
|
|
|
|
front_metrics_ultrahot_hit(class_idx);
|
|
|
|
|
HAK_RET_ALLOC(class_idx, base);
|
|
|
|
|
}
|
|
|
|
|
// Miss → refill from TLS SLL
|
|
|
|
|
if (class_idx >= 2 && class_idx <= 5) {
|
|
|
|
|
front_metrics_ultrahot_miss(class_idx);
|
|
|
|
|
ultra_hot_try_refill(class_idx);
|
|
|
|
|
base = ultra_hot_alloc(size);
|
|
|
|
|
if (base) {
|
|
|
|
|
front_metrics_ultrahot_hit(class_idx);
|
|
|
|
|
HAK_RET_ALLOC(class_idx, base);
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
**Evidence for removal**:
|
|
|
|
|
- **Default**: OFF (`expect=0` hint in code)
|
|
|
|
|
- **ENV flag**: `HAKMEM_TINY_FRONT_ENABLE_ULTRAHOT=1` (default: OFF)
|
|
|
|
|
- **Comment from code**: "A/B Test Result: UltraHot adds branch overhead (11.7% hit) → HeapV2-only is faster"
|
|
|
|
|
- **Performance impact**: Phase 19-4 showed +12.9% when DISABLED
|
|
|
|
|
|
|
|
|
|
**Why it exists**: Phase 14 experiment to create ultra-fast C2-C5 magazine
|
|
|
|
|
|
|
|
|
|
**Why it failed**: Branch overhead outweighs magazine hit rate benefit
|
|
|
|
|
|
|
|
|
|
**Removal impact**:
|
|
|
|
|
- **Assembly reduction**: ~100-150 lines
|
|
|
|
|
- **Performance gain**: +10-15% (measured in Phase 19-4)
|
|
|
|
|
- **Risk**: NONE (already disabled, proven harmful)
|
|
|
|
|
|
|
|
|
|
**Files to delete**:
|
|
|
|
|
- `core/front/tiny_ultra_hot.h` (147 lines)
|
|
|
|
|
- `core/front/tiny_ultra_hot.c` (if exists)
|
|
|
|
|
- Remove from `tiny_alloc_fast.inc.h:34,669-686`
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
### 2. HeapV2 (Phase 13-A) - **DELETE**
|
|
|
|
|
|
|
|
|
|
**Location**: `tiny_alloc_fast.inc.h:693-701`
|
|
|
|
|
|
|
|
|
|
**Code**:
|
|
|
|
|
```c
|
|
|
|
|
if (__builtin_expect(tiny_heap_v2_enabled() && front_prune_heapv2_enabled(), 0) && class_idx <= 3) {
|
|
|
|
|
void* base = tiny_heap_v2_alloc_by_class(class_idx);
|
|
|
|
|
if (base) {
|
|
|
|
|
front_metrics_heapv2_hit(class_idx);
|
|
|
|
|
HAK_RET_ALLOC(class_idx, base);
|
|
|
|
|
} else {
|
|
|
|
|
front_metrics_heapv2_miss(class_idx);
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
**Evidence for removal**:
|
|
|
|
|
- **Default**: OFF (`expect=0` hint)
|
|
|
|
|
- **ENV flag**: `HAKMEM_TINY_HEAP_V2=1` + `HAKMEM_TINY_FRONT_DISABLE_HEAPV2=0` (both required)
|
|
|
|
|
- **Redundancy**: Overlaps with Ring Cache (Phase 21-1) which is better
|
|
|
|
|
- **Target**: C0-C3 only (same as Ring Cache)
|
|
|
|
|
|
|
|
|
|
**Why it exists**: Phase 13 experiment for per-thread magazine
|
|
|
|
|
|
|
|
|
|
**Why it's redundant**: Ring Cache (Phase 21-1) achieves +15-20% improvement, HeapV2 never showed positive results
|
|
|
|
|
|
|
|
|
|
**Removal impact**:
|
|
|
|
|
- **Assembly reduction**: ~80-120 lines
|
|
|
|
|
- **Performance gain**: +5-10% (branch removal)
|
|
|
|
|
- **Risk**: LOW (disabled by default, Ring Cache is superior)
|
|
|
|
|
|
|
|
|
|
**Files to delete**:
|
|
|
|
|
- `core/front/tiny_heap_v2.h` (200+ lines)
|
|
|
|
|
- Remove from `tiny_alloc_fast.inc.h:33,693-701`
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
### 3. Front C23 (Phase B) - **DELETE**
|
|
|
|
|
|
|
|
|
|
**Location**: `tiny_alloc_fast.inc.h:610-617`
|
|
|
|
|
|
|
|
|
|
**Code**:
|
|
|
|
|
```c
|
|
|
|
|
if (tiny_front_c23_enabled() && (class_idx == 2 || class_idx == 3)) {
|
|
|
|
|
void* c23_ptr = tiny_front_c23_alloc(size, class_idx);
|
|
|
|
|
if (c23_ptr) {
|
|
|
|
|
HAK_RET_ALLOC(class_idx, c23_ptr);
|
|
|
|
|
}
|
|
|
|
|
// Fall through to existing path if C23 path failed (NULL)
|
|
|
|
|
}
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
**Evidence for removal**:
|
|
|
|
|
- **ENV flag**: `HAKMEM_TINY_FRONT_C23_SIMPLE=1` (opt-in)
|
|
|
|
|
- **Redundancy**: Overlaps with Ring Cache (C2/C3) which is superior
|
|
|
|
|
- **Target**: 128B/256B (same as Ring Cache)
|
|
|
|
|
- **Result**: Never showed improvement over Ring Cache
|
|
|
|
|
|
|
|
|
|
**Why it exists**: Phase B experiment for ultra-simple C2/C3 frontend
|
|
|
|
|
|
|
|
|
|
**Why it's redundant**: Ring Cache (Phase 21-1) is simpler and faster (+15-20% measured)
|
|
|
|
|
|
|
|
|
|
**Removal impact**:
|
|
|
|
|
- **Assembly reduction**: ~60-80 lines
|
|
|
|
|
- **Performance gain**: +3-5% (branch removal)
|
|
|
|
|
- **Risk**: NONE (Ring Cache is strictly better)
|
|
|
|
|
|
|
|
|
|
**Files to delete**:
|
|
|
|
|
- `core/front/tiny_front_c23.h` (100+ lines)
|
|
|
|
|
- Remove from `tiny_alloc_fast.inc.h:30,610-617`
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
### 4. FastCache (C0-C3 array stack) - **CONSOLIDATE into SFC**
|
|
|
|
|
|
|
|
|
|
**Location**: `tiny_alloc_fast.inc.h:232-244`
|
|
|
|
|
|
|
|
|
|
**Code**:
|
|
|
|
|
```c
|
|
|
|
|
if (__builtin_expect(g_fastcache_enable && class_idx <= 3, 1)) {
|
|
|
|
|
void* fc = fastcache_pop(class_idx);
|
|
|
|
|
if (__builtin_expect(fc != NULL, 1)) {
|
|
|
|
|
extern unsigned long long g_front_fc_hit[];
|
|
|
|
|
g_front_fc_hit[class_idx]++;
|
|
|
|
|
return fc;
|
|
|
|
|
} else {
|
|
|
|
|
extern unsigned long long g_front_fc_miss[];
|
|
|
|
|
g_front_fc_miss[class_idx]++;
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
**Evidence for consolidation**:
|
|
|
|
|
- **Overlap**: FastCache (C0-C3) and SFC (all classes) are both array stacks
|
|
|
|
|
- **Redundancy**: SFC is more general (supports all classes C0-C7)
|
|
|
|
|
- **Performance**: SFC showed better results in Phase 5-NEW
|
|
|
|
|
|
|
|
|
|
**Why both exist**: Historical accumulation (FastCache was first, SFC came later)
|
|
|
|
|
|
|
|
|
|
**Why consolidate**: One unified array cache is simpler and faster than two
|
|
|
|
|
|
|
|
|
|
**Consolidation plan**:
|
|
|
|
|
1. Keep SFC (more general)
|
|
|
|
|
2. Remove FastCache-specific code
|
|
|
|
|
3. Configure SFC for all classes C0-C7
|
|
|
|
|
|
|
|
|
|
**Removal impact**:
|
|
|
|
|
- **Assembly reduction**: ~80-100 lines
|
|
|
|
|
- **Performance gain**: +5-8% (one less branch check)
|
|
|
|
|
- **Risk**: LOW (SFC is proven, just extend capacity for C0-C3)
|
|
|
|
|
|
|
|
|
|
**Files to modify**:
|
|
|
|
|
- Delete `core/hakmem_tiny_fastcache.inc.h` (8KB)
|
|
|
|
|
- Keep `core/tiny_alloc_fast_sfc.inc.h` (8.6KB)
|
|
|
|
|
- Remove from `tiny_alloc_fast.inc.h:19,232-244`
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
### 5. Class5 Hotpath (256B dedicated path) - **MERGE into main path**
|
|
|
|
|
|
|
|
|
|
**Location**: `tiny_alloc_fast.inc.h:710-732`
|
|
|
|
|
|
|
|
|
|
**Code**:
|
|
|
|
|
```c
|
|
|
|
|
if (__builtin_expect(hot_c5, 0)) {
|
|
|
|
|
// class5: dedicated shortest path (generic front bypassed entirely)
|
|
|
|
|
void* p = tiny_class5_minirefill_take();
|
|
|
|
|
if (p) {
|
|
|
|
|
front_metrics_class5_hit(class_idx);
|
|
|
|
|
HAK_RET_ALLOC(class_idx, p);
|
|
|
|
|
}
|
|
|
|
|
// ... refill + retry logic (20 lines)
|
|
|
|
|
// slow path (bypass generic front)
|
|
|
|
|
ptr = hak_tiny_alloc_slow(size, class_idx);
|
|
|
|
|
if (ptr) HAK_RET_ALLOC(class_idx, ptr);
|
|
|
|
|
return ptr;
|
|
|
|
|
}
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
**Evidence for removal**:
|
|
|
|
|
- **ENV flag**: `HAKMEM_TINY_HOTPATH_CLASS5=0` (default: OFF)
|
|
|
|
|
- **Special case**: Only benefits 256B allocations
|
|
|
|
|
- **Complexity**: 25+ lines of duplicate refill logic
|
|
|
|
|
- **Benefit**: Minimal (bypasses generic front, but Ring Cache handles C5 well)
|
|
|
|
|
|
|
|
|
|
**Why it exists**: Attempt to optimize 256B (common size)
|
|
|
|
|
|
|
|
|
|
**Why to remove**: Ring Cache already optimizes C2/C3/C5, no need for special case
|
|
|
|
|
|
|
|
|
|
**Removal impact**:
|
|
|
|
|
- **Assembly reduction**: ~120-150 lines
|
|
|
|
|
- **Performance gain**: +2-5% (branch removal, I-cache improvement)
|
|
|
|
|
- **Risk**: LOW (disabled by default, Ring Cache handles C5)
|
|
|
|
|
|
|
|
|
|
**Files to modify**:
|
|
|
|
|
- Remove from `tiny_alloc_fast.inc.h:100-112,710-732`
|
|
|
|
|
- Remove `g_tiny_hotpath_class5` from `hakmem_tiny.c:120`
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
### 6. Front-Direct Mode (experimental bypass) - **SIMPLIFY**
|
|
|
|
|
|
|
|
|
|
**Location**: `tiny_alloc_fast.inc.h:704-708,759-775`
|
|
|
|
|
|
|
|
|
|
**Code**:
|
|
|
|
|
```c
|
|
|
|
|
static __thread int s_front_direct_alloc = -1;
|
|
|
|
|
if (__builtin_expect(s_front_direct_alloc == -1, 0)) {
|
|
|
|
|
const char* e = getenv("HAKMEM_TINY_FRONT_DIRECT");
|
|
|
|
|
s_front_direct_alloc = (e && *e && *e != '0') ? 1 : 0;
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
if (s_front_direct_alloc) {
|
|
|
|
|
// Front-Direct: Direct SS→FC refill (bypasses SLL/TLS List)
|
|
|
|
|
int refilled_fc = tiny_alloc_fast_refill(class_idx);
|
|
|
|
|
if (__builtin_expect(refilled_fc > 0, 1)) {
|
|
|
|
|
void* fc_ptr = fastcache_pop(class_idx);
|
|
|
|
|
if (fc_ptr) HAK_RET_ALLOC(class_idx, fc_ptr);
|
|
|
|
|
}
|
|
|
|
|
} else {
|
|
|
|
|
// Legacy: Refill to TLS List/SLL
|
|
|
|
|
extern __thread TinyTLSList g_tls_lists[TINY_NUM_CLASSES];
|
|
|
|
|
void* took = tiny_fast_refill_and_take(class_idx, &g_tls_lists[class_idx]);
|
|
|
|
|
if (took) HAK_RET_ALLOC(class_idx, took);
|
|
|
|
|
}
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
**Evidence for simplification**:
|
|
|
|
|
- **Dual paths**: Front-Direct vs Legacy (mutually exclusive)
|
|
|
|
|
- **Complexity**: TLS caching of ENV flag + two refill paths
|
|
|
|
|
- **Benefit**: Unclear (no documented A/B test results)
|
|
|
|
|
|
|
|
|
|
**Why to simplify**: Pick ONE refill strategy, remove toggle
|
|
|
|
|
|
|
|
|
|
**Simplification plan**:
|
|
|
|
|
1. A/B test Front-Direct vs Legacy
|
|
|
|
|
2. Keep winner, delete loser
|
|
|
|
|
3. Remove ENV toggle
|
|
|
|
|
|
|
|
|
|
**Removal impact** (after A/B):
|
|
|
|
|
- **Assembly reduction**: ~100-150 lines
|
|
|
|
|
- **Performance gain**: +5-10% (one less branch + simpler refill)
|
|
|
|
|
- **Risk**: MEDIUM (need A/B test to pick winner)
|
|
|
|
|
|
|
|
|
|
**Action**: A/B test required before removal
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
## Features to KEEP (Proven performers)
|
|
|
|
|
|
|
|
|
|
### 1. Unified Cache (Phase 23) - **KEEP & PROMOTE**
|
|
|
|
|
|
|
|
|
|
**Location**: `tiny_alloc_fast.inc.h:623-635`
|
|
|
|
|
|
|
|
|
|
**Evidence for keeping**:
|
|
|
|
|
- **Target**: All classes C0-C7 (comprehensive)
|
|
|
|
|
- **Design**: Single-layer tcache (simple)
|
|
|
|
|
- **Performance**: +20-30% improvement documented (Phase 23-E)
|
2025-11-26 14:45:26 +09:00
|
|
|
- **ENV flag**: `HAKMEM_TINY_UNIFIED_CACHE=1`
|
2025-11-26 13:14:18 +09:00
|
|
|
|
|
|
|
|
**Recommendation**: **Make this the PRIMARY frontend** (Layer 0)
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
### 2. Ring Cache (Phase 21-1) - **KEEP as fallback OR MERGE into Unified**
|
|
|
|
|
|
|
|
|
|
**Location**: `tiny_alloc_fast.inc.h:641-659`
|
|
|
|
|
|
|
|
|
|
**Evidence for keeping**:
|
|
|
|
|
- **Target**: C2/C3 (hot classes)
|
|
|
|
|
- **Performance**: +15-20% improvement (54.4M → 62-65M ops/s)
|
|
|
|
|
- **Design**: Array-based TLS cache (no pointer chasing)
|
|
|
|
|
- **ENV flag**: `HAKMEM_TINY_HOT_RING_ENABLE=1` (default: ON)
|
|
|
|
|
|
|
|
|
|
**Decision needed**: Ring Cache vs Unified Cache (both are array-based)
|
|
|
|
|
- Option A: Keep Ring Cache only (C2/C3 specialized)
|
|
|
|
|
- Option B: Keep Unified Cache only (all classes)
|
|
|
|
|
- Option C: Keep both (redundant?)
|
|
|
|
|
|
|
|
|
|
**Recommendation**: **A/B test Ring vs Unified**, keep winner only
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
### 3. TLS SLL (mimalloc-inspired freelist) - **KEEP**
|
|
|
|
|
|
|
|
|
|
**Location**: `tiny_alloc_fast.inc.h:278-305,736-752`
|
|
|
|
|
|
|
|
|
|
**Evidence for keeping**:
|
|
|
|
|
- **Purpose**: Unlimited overflow when Layer 0 cache is full
|
|
|
|
|
- **Performance**: Critical for variable working sets
|
|
|
|
|
- **Simplicity**: Minimal overhead (3-4 instructions)
|
|
|
|
|
|
|
|
|
|
**Recommendation**: **Keep as Layer 1** (overflow from Layer 0)
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
### 4. SuperSlab Backend - **KEEP**
|
|
|
|
|
|
|
|
|
|
**Location**: `hakmem_tiny.c` + `tiny_superslab_*.inc.h`
|
|
|
|
|
|
|
|
|
|
**Evidence for keeping**:
|
|
|
|
|
- **Purpose**: Memory allocation source (mmap wrapper)
|
|
|
|
|
- **Performance**: Essential (no alternative)
|
|
|
|
|
|
|
|
|
|
**Recommendation**: **Keep as Layer 2** (backend refill source)
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
## Summary: Removal Priority List
|
|
|
|
|
|
|
|
|
|
### High Priority (Remove immediately):
|
|
|
|
|
1. ✅ **UltraHot** - Proven harmful (+12.9% when disabled)
|
|
|
|
|
2. ✅ **HeapV2** - Redundant with Ring Cache
|
|
|
|
|
3. ✅ **Front C23** - Redundant with Ring Cache
|
|
|
|
|
4. ✅ **Class5 Hotpath** - Special case, unnecessary
|
|
|
|
|
|
|
|
|
|
### Medium Priority (Remove after A/B test):
|
|
|
|
|
5. ⚠️ **FastCache** - Consolidate into SFC or Unified Cache
|
|
|
|
|
6. ⚠️ **Front-Direct** - A/B test, then pick one refill path
|
|
|
|
|
|
|
|
|
|
### Low Priority (Evaluate later):
|
|
|
|
|
7. 🔍 **SFC vs Unified Cache** - Both are array caches, pick one
|
|
|
|
|
8. 🔍 **Ring Cache** - Specialized (C2/C3) vs Unified (all classes)
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
## Expected Assembly Reduction
|
|
|
|
|
|
|
|
|
|
| Feature | Assembly Lines | Removal Impact |
|
|
|
|
|
|---------|----------------|----------------|
|
|
|
|
|
| UltraHot | ~150 | High priority |
|
|
|
|
|
| HeapV2 | ~120 | High priority |
|
|
|
|
|
| Front C23 | ~80 | High priority |
|
|
|
|
|
| Class5 Hotpath | ~150 | High priority |
|
|
|
|
|
| FastCache | ~100 | Medium priority |
|
|
|
|
|
| Front-Direct | ~150 | Medium priority |
|
|
|
|
|
| **Total** | **~750 lines** | **-70% of current bloat** |
|
|
|
|
|
|
|
|
|
|
**Current**: 2624 assembly lines
|
|
|
|
|
**After removal**: ~1000-1200 lines (-60%)
|
|
|
|
|
**After optimization**: ~150-200 lines (target)
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
## Recommended Action Plan
|
|
|
|
|
|
|
|
|
|
**Week 1 - High Priority Removals**:
|
|
|
|
|
1. Delete UltraHot (4 hours)
|
|
|
|
|
2. Delete HeapV2 (4 hours)
|
|
|
|
|
3. Delete Front C23 (2 hours)
|
|
|
|
|
4. Delete Class5 Hotpath (2 hours)
|
|
|
|
|
5. **Test & benchmark** (4 hours)
|
|
|
|
|
|
|
|
|
|
**Expected result**: 23.6M → 40-50M ops/s (+70-110%)
|
|
|
|
|
|
|
|
|
|
**Week 2 - A/B Tests & Consolidation**:
|
|
|
|
|
6. A/B: FastCache vs SFC (1 day)
|
|
|
|
|
7. A/B: Front-Direct vs Legacy (1 day)
|
|
|
|
|
8. A/B: Ring Cache vs Unified Cache (1 day)
|
|
|
|
|
9. **Pick winners, remove losers** (1 day)
|
|
|
|
|
|
|
|
|
|
**Expected result**: 40-50M → 70-90M ops/s (+200-280% total)
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
## Conclusion
|
|
|
|
|
|
|
|
|
|
The current codebase has **6 features that can be removed immediately** with zero risk:
|
|
|
|
|
- 4 are disabled by default and proven harmful (UltraHot, HeapV2, Front C23, Class5)
|
|
|
|
|
- 2 need A/B testing to pick winners (FastCache/SFC, Front-Direct/Legacy)
|
|
|
|
|
|
|
|
|
|
**Total cleanup potential**: ~750 assembly lines (-70% bloat), +200-300% performance improvement.
|
|
|
|
|
|
|
|
|
|
**Recommended first action**: Start with High Priority removals (1 week), which are safe and deliver immediate gains.
|