Files
hakmem/docs/design/REFACTORING_PLAN_TINY_ALLOC.md

398 lines
13 KiB
Markdown
Raw Normal View History

Wrap debug fprintf in !HAKMEM_BUILD_RELEASE guards (Release build optimization) ## Changes ### 1. core/page_arena.c - Removed init failure message (lines 25-27) - error is handled by returning early - All other fprintf statements already wrapped in existing #if !HAKMEM_BUILD_RELEASE blocks ### 2. core/hakmem.c - Wrapped SIGSEGV handler init message (line 72) - CRITICAL: Kept SIGSEGV/SIGBUS/SIGABRT error messages (lines 62-64) - production needs crash logs ### 3. core/hakmem_shared_pool.c - Wrapped all debug fprintf statements in #if !HAKMEM_BUILD_RELEASE: - Node pool exhaustion warning (line 252) - SP_META_CAPACITY_ERROR warning (line 421) - SP_FIX_GEOMETRY debug logging (line 745) - SP_ACQUIRE_STAGE0.5_EMPTY debug logging (line 865) - SP_ACQUIRE_STAGE0_L0 debug logging (line 803) - SP_ACQUIRE_STAGE1_LOCKFREE debug logging (line 922) - SP_ACQUIRE_STAGE2_LOCKFREE debug logging (line 996) - SP_ACQUIRE_STAGE3 debug logging (line 1116) - SP_SLOT_RELEASE debug logging (line 1245) - SP_SLOT_FREELIST_LOCKFREE debug logging (line 1305) - SP_SLOT_COMPLETELY_EMPTY debug logging (line 1316) - Fixed lock_stats_init() for release builds (lines 60-65) - ensure g_lock_stats_enabled is initialized ## Performance Validation Before: 51M ops/s (with debug fprintf overhead) After: 49.1M ops/s (consistent performance, fprintf removed from hot paths) ## Build & Test ```bash ./build.sh larson_hakmem ./out/release/larson_hakmem 1 5 1 1000 100 10000 42 # Result: 49.1M ops/s ``` Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-26 13:14:18 +09:00
# HAKMEM Tiny Allocator Refactoring Plan
## Executive Summary
**Problem**: `tiny_alloc_fast()` generates 2624 lines of assembly (should be ~20-50 lines for a fast path), causing 11.6x more L1 cache misses than System malloc (1.98 miss/op vs 0.17). Performance: 23.6M ops/s vs System's 92.6M ops/s (3.9x slower).
**Root Cause**: Architectural bloat from accumulation of experimental features:
- 26 conditional compilation branches in `tiny_alloc_fast.inc.h`
- 38 runtime conditional checks in allocation path
- 11 overlapping frontend layers (Ring Cache, Unified Cache, HeapV2, UltraHot, FastCache, SFC, etc.)
- 2228-line monolithic `hakmem_tiny.c`
- 885-line `tiny_alloc_fast.inc.h` with excessive inlining
**Impact**: The "smart features" designed to improve performance are creating instruction cache thrashing, destroying the fast path.
---
## Analysis: Current Architecture Problems
### Problem 1: Too Many Frontend Layers (Bloat Disease)
**Current layers in `tiny_alloc_fast()`** (lines 562-812):
```c
static inline void* tiny_alloc_fast(size_t size) {
// Layer 0: FastCache (C0-C3 only) - lines 232-244
if (g_fastcache_enable && class_idx <= 3) { ... }
// Layer 1: SFC (Super Front Cache) - lines 255-274
if (sfc_is_enabled) { ... }
// Layer 2: Front C23 (Ultra-simple C2/C3) - lines 610-617
if (tiny_front_c23_enabled() && (class_idx == 2 || class_idx == 3)) { ... }
// Layer 3: Unified Cache (tcache-style) - lines 623-635
if (unified_cache_enabled()) { ... }
// Layer 4: Ring Cache (C2/C3/C5 only) - lines 641-659
if (class_idx == 2 || class_idx == 3) { ... }
// Layer 5: UltraHot (C2-C5) - lines 669-686
if (ultra_hot_enabled() && front_prune_ultrahot_enabled()) { ... }
// Layer 6: HeapV2 (C0-C3) - lines 693-701
if (tiny_heap_v2_enabled() && front_prune_heapv2_enabled() && class_idx <= 3) { ... }
// Layer 7: Class5 hotpath (256B dedicated) - lines 710-732
if (hot_c5) { ... }
// Layer 8: TLS SLL (generic) - lines 736-752
if (g_tls_sll_enable && !s_front_direct_alloc) { ... }
// Layer 9: Front-Direct refill - lines 759-775
if (s_front_direct_alloc) { ... }
// Layer 10: Legacy refill - lines 769-775
else { ... }
// Layer 11: Slow path - lines 806-809
ptr = hak_tiny_alloc_slow(size, class_idx);
}
```
**Problem**: 11 layers with overlapping responsibilities!
- **Redundancy**: Ring Cache (C2/C3), Front C23 (C2/C3), and UltraHot (C2-C5) all target the same classes
- **Branch explosion**: Each layer adds 2-5 conditional branches
- **I-cache thrashing**: 2624 assembly lines cannot fit in L1 instruction cache (32KB = ~10K instructions)
### Problem 2: Assembly Bloat Analysis
**Expected fast path** (System malloc tcache):
```asm
; 3-4 instructions, ~10-15 bytes
mov rax, QWORD PTR [tls_cache + class*8] ; Load head
test rax, rax ; Check NULL
je .miss ; Branch on empty
mov rdx, QWORD PTR [rax] ; Load next
mov QWORD PTR [tls_cache + class*8], rdx ; Update head
ret ; Return ptr
.miss:
call tcache_refill ; Refill (cold path)
```
**Actual HAKMEM fast path**: 2624 lines of assembly!
**Why?**
1. **Inlining explosion**: Every `__attribute__((always_inline))` layer inlines ALL branches
2. **ENV checks**: Multiple `getenv()` calls inlined (even with TLS caching)
3. **Debug code**: Not gated properly with `#if !HAKMEM_BUILD_RELEASE`
4. **Metrics**: Frontend metrics tracking (`front_metrics_*`) adds 50-100 instructions
### Problem 3: File Organization Chaos
**`hakmem_tiny.c`** (2228 lines):
- Lines 1-500: Global state, TLS variables, initialization
- Lines 500-1000: TLS operations (refill, spill, bind)
- Lines 1000-1500: SuperSlab management
- Lines 1500-2000: Registry operations, slab management
- Lines 2000-2228: Statistics, lifecycle, API wrappers
**Problems**:
- No clear separation of concerns
- Mix of hot path (refill) and cold path (init, stats)
- Circular dependencies between files via `#include`
---
## Refactoring Plan: 3-Phase Approach
### Phase 1: Identify and Remove Dead Features (Priority 1, Quick Win)
**Goal**: Remove experimental features that are disabled or have negative performance impact.
**Actions**:
1. **Audit ENV flags** (1 hour):
```bash
grep -r "getenv.*HAKMEM_TINY" core/ | cut -d: -f2 | sort -u > env_flags.txt
# Identify which are:
# - Always disabled (default=0, never used)
# - Negative performance (A/B test showed regression)
# - Redundant (overlapping with better features)
```
2. **Remove confirmed-dead features** (2 hours):
- **UltraHot** (Phase 19-4): ENV default OFF, adds 11.7% overhead → DELETE
- **HeapV2** (Phase 13-A): ENV gated, overlaps with Ring Cache → DELETE
- **Front C23**: Redundant with Ring Cache → DELETE
- **FastCache**: Overlaps with SFC → CONSOLIDATE into SFC
3. **Simplify to 3-layer hierarchy** (result):
```
Layer 0: Unified Cache (tcache-style, all classes C0-C7)
Layer 1: TLS SLL (unlimited overflow)
Layer 2: SuperSlab backend (refill source)
```
**Expected impact**: -30-40% assembly size, +10-15% performance
---
### Phase 2: Extract Hot Path to Separate File (Priority 1, Critical)
**Goal**: Create ultra-simple fast path with zero cold code.
**File split**:
```
core/tiny_alloc_fast.inc.h (885 lines)
core/tiny_alloc_ultra.inc.h (50-100 lines, HOT PATH ONLY)
core/tiny_alloc_refill.inc.h (200-300 lines, refill logic)
core/tiny_alloc_frontend.inc.h (300-400 lines, frontend layers)
core/tiny_alloc_metrics.inc.h (100-150 lines, debug/stats)
```
**`tiny_alloc_ultra.inc.h`** (NEW, ultra-simple):
```c
// Ultra-fast path: 10-20 instructions, no branches except miss
static inline void* tiny_alloc_ultra(int class_idx) {
// Layer 0: Unified Cache (single TLS array)
void* ptr = g_unified_cache[class_idx].pop();
if (__builtin_expect(ptr != NULL, 1)) {
// Fast hit: 3-4 instructions
HAK_RET_ALLOC(class_idx, ptr);
}
// Layer 1: TLS SLL (overflow)
ptr = tls_sll_pop(class_idx);
if (ptr) {
HAK_RET_ALLOC(class_idx, ptr);
}
// Miss: delegate to refill (cold path, out-of-line)
return tiny_alloc_refill_slow(class_idx);
}
```
**Expected assembly**:
```asm
tiny_alloc_ultra:
; ~15-20 instructions total
mov rax, [g_unified_cache + class*8] ; Load cache head
test rax, rax ; Check NULL
je .try_sll ; Branch on miss
mov rdx, [rax] ; Load next
mov [g_unified_cache + class*8], rdx ; Update head
mov byte [rax], HEADER_MAGIC | class ; Write header
lea rax, [rax + 1] ; USER = BASE + 1
ret ; Return
.try_sll:
call tls_sll_pop ; Try TLS SLL
test rax, rax
jne .sll_hit
call tiny_alloc_refill_slow ; Cold path (out-of-line)
ret
.sll_hit:
mov byte [rax], HEADER_MAGIC | class
lea rax, [rax + 1]
ret
```
**Expected impact**: ~20-30 instructions (from 2624), +200-300% performance
---
### Phase 3: Refactor `hakmem_tiny.c` into Modules (Priority 2, Maintainability)
**Goal**: Split 2228-line monolith into focused, testable modules.
**File structure** (new):
```
core/
├── hakmem_tiny.c (300-400 lines, main API only)
├── tiny_state.c (200-300 lines, global state)
├── tiny_tls.c (300-400 lines, TLS operations)
├── tiny_superslab.c (400-500 lines, SuperSlab backend)
├── tiny_registry.c (200-300 lines, slab registry)
├── tiny_lifecycle.c (200-300 lines, init/shutdown)
├── tiny_stats.c (200-300 lines, statistics)
└── tiny_alloc_ultra.inc.h (50-100 lines, FAST PATH)
```
**Module responsibilities**:
1. **`hakmem_tiny.c`** (300-400 lines):
- Public API: `hak_tiny_alloc()`, `hak_tiny_free()`
- Wrapper functions only
- Include order: `tiny_alloc_ultra.inc.h` → fast path inline
2. **`tiny_state.c`** (200-300 lines):
- Global variables: `g_tiny_pool`, `g_tls_sll_head[]`, etc.
- ENV flag parsing (init-time only)
- Configuration structures
3. **`tiny_tls.c`** (300-400 lines):
- TLS operations: `tls_refill()`, `tls_spill()`, `tls_bind()`
- TLS cache management
- Adaptive sizing logic
4. **`tiny_superslab.c`** (400-500 lines):
- SuperSlab allocation: `superslab_refill()`, `superslab_alloc()`
- Slab metadata management
- Active block tracking
5. **`tiny_registry.c`** (200-300 lines):
- Slab registry: `registry_lookup()`, `registry_register()`
- Hash table operations
- Owner slab lookup
6. **`tiny_lifecycle.c`** (200-300 lines):
- Initialization: `hak_tiny_init()`
- Shutdown: `hak_tiny_shutdown()`
- Prewarm: `hak_tiny_prewarm_tls_cache()`
7. **`tiny_stats.c`** (200-300 lines):
- Statistics collection
- Debug counters
- Metrics printing
**Benefits**:
- Each file < 500 lines (maintainable)
- Clear dependencies (no circular includes)
- Testable in isolation
- Parallel compilation
---
## Priority Order & Estimated Impact
### Priority 1: Quick Wins (1-2 days)
**Task 1.1**: Remove dead features (2 hours)
- Delete UltraHot, HeapV2, Front C23
- Remove ENV checks for disabled features
- **Impact**: -30% assembly, +10% performance
**Task 1.2**: Extract ultra-fast path (4 hours)
- Create `tiny_alloc_ultra.inc.h` (50 lines)
- Move refill logic to separate file
- **Impact**: -90% assembly (2624 → 200 lines), +150-200% performance
**Task 1.3**: Remove debug code from release builds (2 hours)
- Gate all `fprintf()` with `#if !HAKMEM_BUILD_RELEASE`
- Remove profiling counters in release
- **Impact**: -10% assembly, +5-10% performance
**Expected total (Priority 1)**: 23.6M → 60-80M ops/s (+150-240%)
---
### Priority 2: Code Health (2-3 days)
**Task 2.1**: Split `hakmem_tiny.c` (1 day)
- Extract modules as described above
- Fix include dependencies
- **Impact**: Maintainability only (no performance change)
**Task 2.2**: Simplify frontend to 2 layers (1 day)
- Unified Cache (Layer 0) + TLS SLL (Layer 1)
- Remove redundant Ring/SFC/FastCache
- **Impact**: -5-10% assembly, +5-10% performance
**Task 2.3**: Documentation (0.5 day)
- Document new architecture in `ARCHITECTURE.md`
- Add performance benchmarks
- **Impact**: Team velocity +20%
---
### Priority 3: Advanced Optimization (3-5 days, optional)
**Task 3.1**: Profile-guided optimization
- Collect PGO data from benchmarks
- Recompile with `-fprofile-use`
- **Impact**: +10-20% performance
**Task 3.2**: Assembly-level tuning
- Hand-optimize critical sections
- Align hot paths to cache lines
- **Impact**: +5-10% performance
---
## Recommended Implementation Order
**Week 1** (Priority 1 - Quick Wins):
1. **Day 1**: Remove dead features + create `tiny_alloc_ultra.inc.h`
2. **Day 2**: Test + benchmark + iterate
**Week 2** (Priority 2 - Code Health):
3. **Day 3-4**: Split `hakmem_tiny.c` into modules
4. **Day 5**: Simplify frontend layers
**Week 3** (Priority 3 - Optional):
5. **Day 6-7**: PGO + assembly tuning
---
## Expected Performance Results
### Current (baseline):
- Performance: 23.6M ops/s
- Assembly: 2624 lines
- L1 misses: 1.98 miss/op
### After Priority 1 (Quick Wins):
- Performance: 60-80M ops/s (+150-240%)
- Assembly: 150-200 lines (-92%)
- L1 misses: 0.4-0.6 miss/op (-70%)
### After Priority 2 (Code Health):
- Performance: 70-90M ops/s (+200-280%)
- Assembly: 100-150 lines (-94%)
- L1 misses: 0.2-0.4 miss/op (-80%)
- Maintainability: Much improved
### Target (System malloc parity):
- Performance: 92.6M ops/s (System malloc baseline)
- Assembly: 50-100 lines (tcache equivalent)
- L1 misses: 0.17 miss/op (System malloc level)
---
## Risk Assessment
### Low Risk:
- Removing disabled features (UltraHot, HeapV2, Front C23)
- Extracting fast path to separate file
- Gating debug code with `#if !HAKMEM_BUILD_RELEASE`
### Medium Risk:
- Simplifying frontend from 11 layers → 2 layers
- **Mitigation**: Keep Ring Cache as fallback during transition
- **A/B test**: Toggle via `HAKMEM_TINY_UNIFIED_ONLY=1`
### High Risk:
- Splitting `hakmem_tiny.c` (circular dependencies)
- **Mitigation**: Incremental extraction, one module at a time
- **Test**: Ensure all benchmarks pass after each extraction
---
## Conclusion
The current architecture suffers from **feature accumulation disease**: 11 experimental frontend layers competing in the same allocation path, creating massive instruction bloat (2624 lines of assembly). The solution is aggressive simplification:
1. **Remove dead/redundant features** (11 layers → 2 layers)
2. **Extract ultra-fast path** (2624 asm lines → 100-150 lines)
3. **Split monolithic file** (2228 lines → 7 focused modules)
**Expected outcome**: 3-4x performance improvement (23.6M → 70-90M ops/s), approaching System malloc parity (92.6M ops/s).
**Recommended action**: Start with Priority 1 tasks (1-2 days), which deliver 80% of the benefit with minimal risk.