hakmem/PHASE8_TECHNICAL_ANALYSIS.md

# Phase 8 - Technical Analysis and Root Cause Investigation

## Executive Summary

Phase 8 comprehensive benchmarking reveals **critical performance issues** with HAKMEM:

- **Working Set 256 (Hot Cache)**: 9.4% slower than System malloc, 45.2% slower than mimalloc
- **Working Set 8192 (Realistic)**: **246% slower than System malloc, 485% slower than mimalloc**

The most alarming finding: HAKMEM experiences **4.8x performance degradation** when moving from hot cache to realistic workloads, compared to only 1.5x for System malloc and 1.2x for mimalloc.

## Benchmark Results Summary

### Working Set 256 (Hot Cache)

| Allocator      | Avg (M ops/s) | StdDev | vs HAKMEM |
|----------------|---------------|--------|-----------|
| HAKMEM Phase 8 |   79.2        | ±2.4%  | 1.00x     |
| System malloc  |   86.7        | ±1.0%  | 1.09x     |
| mimalloc       |  114.9        | ±1.2%  | 1.45x     |

### Working Set 8192 (Realistic Workload)

| Allocator      | Avg (M ops/s) | StdDev | vs HAKMEM |
|----------------|---------------|--------|-----------|
| HAKMEM Phase 8 |   16.5        | ±2.5%  | 1.00x     |
| System malloc  |   57.1        | ±1.3%  | 3.46x     |
| mimalloc       |   96.5        | ±0.9%  | 5.85x     |

### Scalability Analysis

Performance degradation from WS256 → WS8192:

- **HAKMEM**: 4.80x slowdown (79.2 → 16.5 M ops/s)
- **System**: 1.52x slowdown (86.7 → 57.1 M ops/s)
- **mimalloc**: 1.19x slowdown (114.9 → 96.5 M ops/s)

**HAKMEM degrades 3.16x MORE than System malloc and 4.03x MORE than mimalloc.**

## Root Cause Analysis

### Evidence from Debug Logs

The benchmark output shows critical issues:

```
[SS_BACKEND] shared_fail→legacy cls=7
[SS_BACKEND] shared_fail→legacy cls=7
[SS_BACKEND] shared_fail→legacy cls=7
[SS_BACKEND] shared_fail→legacy cls=7
```

**Analysis**: Repeated "shared_fail→legacy" messages indicate SuperSlab exhaustion, forcing fallback to legacy allocator path. This happens **4 times** during WS8192 benchmark, suggesting severe SuperSlab fragmentation or capacity issues.

### Issue 1: SuperSlab Architecture Doesn't Scale

**Symptoms**:
- Performance collapses from 79.2 to 16.5 M ops/s (4.8x degradation)
- Shared SuperSlabs fail repeatedly
- TLS_SLL_HDR_RESET events occur (slab header corruption?)

**Root Causes (Hypotheses)**:

1. **SuperSlab Capacity**: Current 512KB SuperSlabs may be too small for WS8192
   - 8192 objects × (16-1024 bytes average) = ~4-8MB working set
   - Multiple SuperSlabs needed → increased lookup overhead

2. **Fragmentation**: SuperSlabs become fragmented with larger working sets
   - Free slots scattered across multiple SuperSlabs
   - Linear search through slab list becomes expensive

3. **TLB Pressure**: More SuperSlabs = more page table entries
   - System malloc uses fewer, larger arenas
   - HAKMEM's 512KB slabs create more TLB misses

4. **Cache Pollution**: Slab metadata pollutes L1/L2 cache
   - Each SuperSlab has metadata overhead
   - More slabs = more metadata = less cache for actual data

### Issue 2: TLS Drain Overhead

Debug logs show:
```
[TLS_SLL_DRAIN] Drain ENABLED (default)
[TLS_SLL_DRAIN] Interval=2048 (default)
```

**Analysis**: Even in hot cache (WS256), HAKMEM is 9.4% slower than System malloc. This suggests fast-path overhead from TLS drain checks happening every 2048 operations.

**Evidence**:
- WS256 should fit entirely in cache, yet HAKMEM still lags
- System malloc has simpler fast path (no drain logic)
- 9.4% overhead = ~7-8 extra cycles per allocation

### Issue 3: TLS_SLL_HDR_RESET Events

```
[TLS_SLL_HDR_RESET] cls=6 base=0x790999b35a0e got=0x00 expect=0xa6 count=0
```

**Analysis**: Header reset events suggest slab list corruption or validation failures. This shouldn't happen in normal operation and indicates potential race conditions or memory corruption.

## Performance Breakdown

### Where HAKMEM Loses Performance (WS8192)

Estimated cycle budget (assuming 3.5 GHz CPU):

- **HAKMEM**: 16.5 M ops/s = ~212 cycles/operation
- **System**: 57.1 M ops/s = ~61 cycles/operation
- **mimalloc**: 96.5 M ops/s = ~36 cycles/operation

**Gap Analysis**:
- HAKMEM uses **151 extra cycles** vs System malloc
- HAKMEM uses **176 extra cycles** vs mimalloc

Where do these cycles go?

1. **SuperSlab Lookup** (~50-80 cycles)
   - Linear search through slab list
   - Cache misses on slab metadata
   - TLB misses on slab pages

2. **TLS Drain Logic** (~10-15 cycles)
   - Drain counter checks every allocation
   - Branch mispredictions

3. **Fragmentation Overhead** (~30-50 cycles)
   - Walking free lists
   - Finding suitable free blocks

4. **Legacy Fallback** (~50-100 cycles when triggered)
   - System malloc/mmap calls
   - Context switches

## Competitive Analysis

### Why System malloc Wins (3.46x faster)

1. **Arena-based design**: Fewer, larger memory regions
2. **Thread caching**: Similar to HAKMEM TLS but better tuned
3. **Mature optimization**: Decades of tuning
4. **Simple fast path**: No drain logic, no SuperSlab lookup

### Why mimalloc Dominates (5.85x faster)

1. **Segment-based design**: Optimal for multi-threaded workloads
2. **Free list sharding**: Reduces contention
3. **Aggressive inlining**: Fast path is 15-20 instructions
4. **No locks in fast path**: Lock-free for thread-local allocations
5. **Delayed freeing**: Like HAKMEM drain but more efficient
6. **Minimal metadata**: Less cache pollution

## Critical Gaps to Address

### Gap 1: Fast Path Performance (9.4% slower at WS256)

**Target**: Match System malloc at hot cache workload
**Required improvement**: +9.4% = +7.5 M ops/s

**Action items**:
- Profile TLS drain overhead
- Inline critical functions more aggressively
- Reduce branch mispredictions
- Consider removing drain logic or making it lazy

### Gap 2: Scalability (246% slower at WS8192)

**Target**: Get within 20% of System malloc at realistic workload
**Required improvement**: +246% = +40.6 M ops/s (2.46x speedup needed!)

**Action items**:
- Fix SuperSlab scaling
- Reduce fragmentation
- Optimize SuperSlab lookup (hash table instead of linear search?)
- Reduce TLB pressure (larger SuperSlabs or better placement)
- Profile cache misses

## Recommendations for Phase 9+

### Phase 9: CRITICAL - SuperSlab Investigation

**Goal**: Understand why SuperSlab performance collapses at WS8192

**Tasks**:
1. Add detailed profiling:
   - SuperSlab lookup latency distribution
   - Cache miss rates (L1, L2, L3)
   - TLB miss rates
   - Fragmentation metrics

2. Measure SuperSlab statistics:
   - Number of active SuperSlabs at WS256 vs WS8192
   - Average slab list length
   - Hit rate for first-slab lookup

3. Experiment with SuperSlab sizes:
   - Try 1MB, 2MB, 4MB SuperSlabs
   - Measure impact on performance

4. Analyze "shared_fail→legacy" events:
   - Why do shared slabs fail?
   - How often does it happen?
   - Can we pre-allocate more capacity?

### Phase 10: Fast Path Optimization

**Goal**: Close 9.4% gap at WS256

**Tasks**:
1. Profile TLS drain overhead
2. Experiment with drain intervals (4096, 8192, disable)
3. Inline more aggressively
4. Add `__builtin_expect` hints for common paths
5. Reduce branch mispredictions

### Phase 11: Architecture Re-evaluation

**Goal**: Decide if SuperSlab model is viable

**Decision point**: If Phase 9 can't get within 50% of System malloc at WS8192, consider:

1. **Hybrid approach**: TLS fast path + different backend (jemalloc-style arenas?)
2. **Abandon SuperSlab**: Switch to segment-based design like mimalloc
3. **Radical simplification**: Focus on specific use case (small allocations only?)

## Success Criteria for Phase 9

Minimum acceptable improvements:
- WS256: 79.2 → 85+ M ops/s (+7% improvement, match System malloc)
- WS8192: 16.5 → 35+ M ops/s (+112% improvement, get to 50% of System malloc)

Stretch goals:
- WS256: 90+ M ops/s (close to System malloc)
- WS8192: 45+ M ops/s (80% of System malloc)

## Raw Data

All benchmark runs completed successfully with good statistical stability (StdDev < 2.5%).

### Working Set 256
```
HAKMEM:   [78.5, 78.1, 77.0, 81.1, 81.2] M ops/s
System:   [87.3, 86.5, 87.5, 85.3, 86.6] M ops/s
mimalloc: [115.8, 115.2, 116.2, 112.5, 115.0] M ops/s
```

### Working Set 8192
```
HAKMEM:   [16.5, 15.8, 16.9, 16.7, 16.6] M ops/s
System:   [56.1, 57.8, 57.0, 57.7, 56.7] M ops/s
mimalloc: [96.8, 96.1, 95.5, 97.7, 96.3] M ops/s
```

## Conclusion

Phase 8 benchmarking reveals fundamental issues with HAKMEM's current architecture:

1. **SuperSlab scaling is broken** - 4.8x performance degradation is unacceptable
2. **Fast path has overhead** - Even hot cache shows 9.4% gap
3. **Competition is fierce** - mimalloc is 5.85x faster at realistic workloads

**Next priority**: Phase 9 MUST focus on understanding and fixing SuperSlab scalability. Without addressing this core issue, HAKMEM cannot compete with production allocators.

The benchmark data is statistically robust (low variance) and reproducible. The performance gaps are real and significant.
-												feat: Add ACE allocation failure tracing and debug hooks

This commit introduces a comprehensive tracing mechanism for allocation failures within the Adaptive Cache Engine (ACE) component. This feature allows for precise identification of the root cause for Out-Of-Memory (OOM) issues related to ACE allocations.

Key changes include:
- **ACE Tracing Implementation**:
  - Added  environment variable to enable/disable detailed logging of allocation failures.
  - Instrumented , , and  to distinguish between "Threshold" (size class mismatch), "Exhaustion" (pool depletion), and "MapFail" (OS memory allocation failure).
- **Build System Fixes**:
  - Corrected  to ensure  is properly linked into , resolving an  error.
- **LD_PRELOAD Wrapper Adjustments**:
  - Investigated and understood the  wrapper's behavior under , particularly its interaction with  and  checks.
  - Enabled debugging flags for  environment to prevent unintended fallbacks to 's  for non-tiny allocations, allowing comprehensive testing of the  allocator.
- **Debugging & Verification**:
  - Introduced temporary verbose logging to pinpoint execution flow issues within  interception and  routing. These temporary logs have been removed.
  - Created  to facilitate testing of the tracing features.

This feature will significantly aid in diagnosing and resolving allocation-related OOM issues in  by providing clear insights into the failure pathways.

											
										
										
											2025-12-01 16:37:59 +09:00
+								# Phase 8 - Technical Analysis and Root Cause Investigation
 								## Executive Summary
 								Phase 8 comprehensive benchmarking reveals **critical performance issues** with HAKMEM:
 								- **Working Set 256 (Hot Cache)**: 9.4% slower than System malloc, 45.2% slower than mimalloc
 								- **Working Set 8192 (Realistic)**: **246% slower than System malloc, 485% slower than mimalloc**
 								The most alarming finding: HAKMEM experiences **4.8x performance degradation** when moving from hot cache to realistic workloads, compared to only 1.5x for System malloc and 1.2x for mimalloc.
 								## Benchmark Results Summary
 								### Working Set 256 (Hot Cache)
 								| Allocator      | Avg (M ops/s) | StdDev | vs HAKMEM |
 								|----------------|---------------|--------|-----------|
 								| HAKMEM Phase 8 |   79.2        | ±2.4%  | 1.00x     |
 								| System malloc  |   86.7        | ±1.0%  | 1.09x     |
 								| mimalloc       |  114.9        | ±1.2%  | 1.45x     |
 								### Working Set 8192 (Realistic Workload)
 								| Allocator      | Avg (M ops/s) | StdDev | vs HAKMEM |
 								|----------------|---------------|--------|-----------|
 								| HAKMEM Phase 8 |   16.5        | ±2.5%  | 1.00x     |
 								| System malloc  |   57.1        | ±1.3%  | 3.46x     |
 								| mimalloc       |   96.5        | ±0.9%  | 5.85x     |
 								### Scalability Analysis
 								Performance degradation from WS256 → WS8192:
 								- **HAKMEM**: 4.80x slowdown (79.2 → 16.5 M ops/s)
 								- **System**: 1.52x slowdown (86.7 → 57.1 M ops/s)
 								- **mimalloc**: 1.19x slowdown (114.9 → 96.5 M ops/s)
 								**HAKMEM degrades 3.16x MORE than System malloc and 4.03x MORE than mimalloc.**
 								## Root Cause Analysis
 								### Evidence from Debug Logs
 								The benchmark output shows critical issues:
 								```
 								[SS_BACKEND] shared_fail→legacy cls=7
 								[SS_BACKEND] shared_fail→legacy cls=7
 								[SS_BACKEND] shared_fail→legacy cls=7
 								[SS_BACKEND] shared_fail→legacy cls=7
 								```
 								**Analysis**: Repeated "shared_fail→legacy" messages indicate SuperSlab exhaustion, forcing fallback to legacy allocator path. This happens **4 times** during WS8192 benchmark, suggesting severe SuperSlab fragmentation or capacity issues.
 								### Issue 1: SuperSlab Architecture Doesn't Scale
 								**Symptoms**:
 								- Performance collapses from 79.2 to 16.5 M ops/s (4.8x degradation)
 								- Shared SuperSlabs fail repeatedly
 								- TLS_SLL_HDR_RESET events occur (slab header corruption?)
 								**Root Causes (Hypotheses)**:
 . **SuperSlab Capacity**: Current 512KB SuperSlabs may be too small for WS8192
 								   - 8192 objects × (16-1024 bytes average) = ~4-8MB working set
 								   - Multiple SuperSlabs needed → increased lookup overhead
 . **Fragmentation**: SuperSlabs become fragmented with larger working sets
 								   - Free slots scattered across multiple SuperSlabs
 								   - Linear search through slab list becomes expensive
 . **TLB Pressure**: More SuperSlabs = more page table entries
 								   - System malloc uses fewer, larger arenas
 								   - HAKMEM's 512KB slabs create more TLB misses
 . **Cache Pollution**: Slab metadata pollutes L1/L2 cache
 								   - Each SuperSlab has metadata overhead
 								   - More slabs = more metadata = less cache for actual data
 								### Issue 2: TLS Drain Overhead
 								Debug logs show:
 								```
 								[TLS_SLL_DRAIN] Drain ENABLED (default)
 								[TLS_SLL_DRAIN] Interval=2048 (default)
 								```
 								**Analysis**: Even in hot cache (WS256), HAKMEM is 9.4% slower than System malloc. This suggests fast-path overhead from TLS drain checks happening every 2048 operations.
 								**Evidence**:
 								- WS256 should fit entirely in cache, yet HAKMEM still lags
 								- System malloc has simpler fast path (no drain logic)
 								- 9.4% overhead = ~7-8 extra cycles per allocation
 								### Issue 3: TLS_SLL_HDR_RESET Events
 								```
 								[TLS_SLL_HDR_RESET] cls=6 base=0x790999b35a0e got=0x00 expect=0xa6 count=0
 								```
 								**Analysis**: Header reset events suggest slab list corruption or validation failures. This shouldn't happen in normal operation and indicates potential race conditions or memory corruption.
 								## Performance Breakdown
 								### Where HAKMEM Loses Performance (WS8192)
 								Estimated cycle budget (assuming 3.5 GHz CPU):
 								- **HAKMEM**: 16.5 M ops/s = ~212 cycles/operation
 								- **System**: 57.1 M ops/s = ~61 cycles/operation
 								- **mimalloc**: 96.5 M ops/s = ~36 cycles/operation
 								**Gap Analysis**:
 								- HAKMEM uses **151 extra cycles** vs System malloc
 								- HAKMEM uses **176 extra cycles** vs mimalloc
 								Where do these cycles go?
 . **SuperSlab Lookup** (~50-80 cycles)
 								   - Linear search through slab list
 								   - Cache misses on slab metadata
 								   - TLB misses on slab pages
 . **TLS Drain Logic** (~10-15 cycles)
 								   - Drain counter checks every allocation
 								   - Branch mispredictions
 . **Fragmentation Overhead** (~30-50 cycles)
 								   - Walking free lists
 								   - Finding suitable free blocks
 . **Legacy Fallback** (~50-100 cycles when triggered)
 								   - System malloc/mmap calls
 								   - Context switches
 								## Competitive Analysis
 								### Why System malloc Wins (3.46x faster)
 . **Arena-based design**: Fewer, larger memory regions
 . **Thread caching**: Similar to HAKMEM TLS but better tuned
 . **Mature optimization**: Decades of tuning
 . **Simple fast path**: No drain logic, no SuperSlab lookup
 								### Why mimalloc Dominates (5.85x faster)
 . **Segment-based design**: Optimal for multi-threaded workloads
 . **Free list sharding**: Reduces contention
 . **Aggressive inlining**: Fast path is 15-20 instructions
 . **No locks in fast path**: Lock-free for thread-local allocations
 . **Delayed freeing**: Like HAKMEM drain but more efficient
 . **Minimal metadata**: Less cache pollution
 								## Critical Gaps to Address
 								### Gap 1: Fast Path Performance (9.4% slower at WS256)
 								**Target**: Match System malloc at hot cache workload
 								**Required improvement**: +9.4% = +7.5 M ops/s
 								**Action items**:
 								- Profile TLS drain overhead
 								- Inline critical functions more aggressively
 								- Reduce branch mispredictions
 								- Consider removing drain logic or making it lazy
 								### Gap 2: Scalability (246% slower at WS8192)
 								**Target**: Get within 20% of System malloc at realistic workload
 								**Required improvement**: +246% = +40.6 M ops/s (2.46x speedup needed!)
 								**Action items**:
 								- Fix SuperSlab scaling
 								- Reduce fragmentation
 								- Optimize SuperSlab lookup (hash table instead of linear search?)
 								- Reduce TLB pressure (larger SuperSlabs or better placement)
 								- Profile cache misses
 								## Recommendations for Phase 9+
 								### Phase 9: CRITICAL - SuperSlab Investigation
 								**Goal**: Understand why SuperSlab performance collapses at WS8192
 								**Tasks**:
 . Add detailed profiling:
 								   - SuperSlab lookup latency distribution
 								   - Cache miss rates (L1, L2, L3)
 								   - TLB miss rates
 								   - Fragmentation metrics
 . Measure SuperSlab statistics:
 								   - Number of active SuperSlabs at WS256 vs WS8192
 								   - Average slab list length
 								   - Hit rate for first-slab lookup
 . Experiment with SuperSlab sizes:
 								   - Try 1MB, 2MB, 4MB SuperSlabs
 								   - Measure impact on performance
 . Analyze "shared_fail→legacy" events:
 								   - Why do shared slabs fail?
 								   - How often does it happen?
 								   - Can we pre-allocate more capacity?
 								### Phase 10: Fast Path Optimization
 								**Goal**: Close 9.4% gap at WS256
 								**Tasks**:
 . Profile TLS drain overhead
 . Experiment with drain intervals (4096, 8192, disable)
 . Inline more aggressively
 . Add `__builtin_expect` hints for common paths
 . Reduce branch mispredictions
 								### Phase 11: Architecture Re-evaluation
 								**Goal**: Decide if SuperSlab model is viable
 								**Decision point**: If Phase 9 can't get within 50% of System malloc at WS8192, consider:
 . **Hybrid approach**: TLS fast path + different backend (jemalloc-style arenas?)
 . **Abandon SuperSlab**: Switch to segment-based design like mimalloc
 . **Radical simplification**: Focus on specific use case (small allocations only?)
 								## Success Criteria for Phase 9
 								Minimum acceptable improvements:
 								- WS256: 79.2 → 85+ M ops/s (+7% improvement, match System malloc)
 								- WS8192: 16.5 → 35+ M ops/s (+112% improvement, get to 50% of System malloc)
 								Stretch goals:
 								- WS256: 90+ M ops/s (close to System malloc)
 								- WS8192: 45+ M ops/s (80% of System malloc)
 								## Raw Data
 								All benchmark runs completed successfully with good statistical stability (StdDev < 2.5%).
 								### Working Set 256
 								```
 								HAKMEM:   [78.5, 78.1, 77.0, 81.1, 81.2] M ops/s
 								System:   [87.3, 86.5, 87.5, 85.3, 86.6] M ops/s
 								mimalloc: [115.8, 115.2, 116.2, 112.5, 115.0] M ops/s
 								```
 								### Working Set 8192
 								```
 								HAKMEM:   [16.5, 15.8, 16.9, 16.7, 16.6] M ops/s
 								System:   [56.1, 57.8, 57.0, 57.7, 56.7] M ops/s
 								mimalloc: [96.8, 96.1, 95.5, 97.7, 96.3] M ops/s
 								```
 								## Conclusion
 								Phase 8 benchmarking reveals fundamental issues with HAKMEM's current architecture:
 . **SuperSlab scaling is broken** - 4.8x performance degradation is unacceptable
 . **Fast path has overhead** - Even hot cache shows 9.4% gap
 . **Competition is fierce** - mimalloc is 5.85x faster at realistic workloads
 								**Next priority**: Phase 9 MUST focus on understanding and fixing SuperSlab scalability. Without addressing this core issue, HAKMEM cannot compete with production allocators.
 								The benchmark data is statistically robust (low variance) and reproducible. The performance gaps are real and significant.