hakmem/PHASE6A_BENCHMARK_RESULTS.md

# Phase 6-A Benchmark Results

**Date**: 2025-11-29
**Change**: Disable SuperSlab lookup debug validation in RELEASE builds
**File**: `core/tiny_region_id.h:199-239`
**Guard**: `#if !HAKMEM_BUILD_RELEASE` around `hak_super_lookup()` call
**Reason**: perf profiling showed 15.84% CPU cost on allocation hot path (debug-only validation)

---

## Executive Summary

Phase 6-A implementation successfully removes debug validation overhead in release builds, but the measured performance impact is **significantly smaller** than predicted:

- **Expected**: +12-15% (random_mixed), +8-10% (mid_mt_gap)
- **Actual (best 3 of 5)**: +1.67% (random_mixed), +1.33% (mid_mt_gap)
- **Actual (excluding warmup)**: +4.07% (random_mixed), +1.97% (mid_mt_gap)

**Recommendation**: HOLD on commit. Investigate discrepancy between perf analysis (15.84% CPU) and benchmark results (~1-4% improvement).

---

## Benchmark Configuration

### Build Configurations

#### Baseline (Before Phase 6-A)
```bash
make clean
make EXTRA_CFLAGS="-g -O3" bench_random_mixed_hakmem bench_mid_mt_gap_hakmem
# Note: Makefile sets -DHAKMEM_BUILD_RELEASE=1 by default
# Result: SuperSlab lookup ALWAYS enabled (no guard in code yet)
```

#### Phase 6-A (After)
```bash
git stash pop  # Restore Phase 6-A changes
make clean
make EXTRA_CFLAGS="-g -O3" bench_random_mixed_hakmem bench_mid_mt_gap_hakmem
# Note: Makefile sets -DHAKMEM_BUILD_RELEASE=1 by default
# Result: SuperSlab lookup DISABLED (guarded by #if !HAKMEM_BUILD_RELEASE)
```

### Benchmark Parameters
- **Iterations**: 1,000,000 operations per run
- **Working Set**: 256 blocks
- **Seed**: 42 (reproducible)
- **Runs**: 5 per configuration
- **Suppression**: `2>/dev/null` to exclude debug output noise

---

## Raw Results

### bench_random_mixed (Tiny workload, 16B-1KB)

#### Baseline (Before Phase 6-A, SuperSlab lookup ALWAYS enabled)
```
Run 1: 53.81 M ops/s
Run 2: 53.25 M ops/s
Run 3: 53.56 M ops/s
Run 4: 49.41 M ops/s
Run 5: 51.41 M ops/s
Average: 52.29 M ops/s
Stdev: 1.86 M ops/s
```

#### Phase 6-A (Release build, SuperSlab lookup DISABLED)
```
Run 1: 39.11 M ops/s  ⚠️ OUTLIER (warmup)
Run 2: 53.30 M ops/s
Run 3: 56.28 M ops/s
Run 4: 52.79 M ops/s
Run 5: 53.72 M ops/s
Average: 51.04 M ops/s (all runs)
Stdev: 6.80 M ops/s (high due to outlier)
Average (excl. Run 1): 54.02 M ops/s
```

**Outlier Analysis**: Run 1 is 27.6% slower than the average of runs 2-5, indicating a warmup/cache-cold issue.

---

### bench_mid_mt_gap (Mid MT workload, 1KB-8KB)

#### Baseline (Before Phase 6-A, SuperSlab lookup ALWAYS enabled)
```
Run 1: 41.70 M ops/s
Run 2: 37.39 M ops/s
Run 3: 40.91 M ops/s
Run 4: 40.53 M ops/s
Run 5: 40.56 M ops/s
Average: 40.22 M ops/s
Stdev: 1.65 M ops/s
```

#### Phase 6-A (Release build, SuperSlab lookup DISABLED)
```
Run 1: 41.49 M ops/s
Run 2: 41.81 M ops/s
Run 3: 41.51 M ops/s
Run 4: 38.43 M ops/s
Run 5: 40.78 M ops/s
Average: 40.80 M ops/s
Stdev: 1.38 M ops/s
```

**Variance Analysis**: Both baseline and Phase 6-A show similar variance (~3-4 M ops/s spread), suggesting measurement noise is inherent to this benchmark.

---

## Statistical Analysis

### Comparison 1: All Runs (Conservative)
| Benchmark | Baseline | Phase 6-A | Absolute | Relative | Expected | Result |
|-----------|----------|-----------|----------|----------|----------|--------|
| random_mixed | 52.29 M | 51.04 M | -1.25 M | **-2.39%** | +12-15% | ❌ FAIL |
| mid_mt_gap | 40.22 M | 40.80 M | +0.59 M | **+1.46%** | +8-10% | ❌ FAIL |

### Comparison 2: Excluding First Run (Warmup Correction)
| Benchmark | Baseline | Phase 6-A | Absolute | Relative | Expected | Result |
|-----------|----------|-----------|----------|----------|----------|--------|
| random_mixed | 51.91 M | 54.02 M | +2.11 M | **+4.07%** | +12-15% | ⚠️ PARTIAL |
| mid_mt_gap | 39.85 M | 40.63 M | +0.78 M | **+1.97%** | +8-10% | ❌ FAIL |

### Comparison 3: Best 3 of 5 (Peak Performance)
| Benchmark | Baseline | Phase 6-A | Absolute | Relative | Expected | Result |
|-----------|----------|-----------|----------|----------|----------|--------|
| random_mixed | 53.54 M | 54.43 M | +0.89 M | **+1.67%** | +12-15% | ❌ FAIL |
| mid_mt_gap | 41.06 M | 41.60 M | +0.54 M | **+1.33%** | +8-10% | ❌ FAIL |

---

## Performance Summary

### Overall Results (Best 3 of 5 method)
- **random_mixed**: 53.54 → 54.43 M ops/s (+1.67%)
- **mid_mt_gap**: 41.06 → 41.60 M ops/s (+1.33%)

### vs Predictions
- **random_mixed**: Expected +12-15%, Actual +1.67% → **FAIL** (8-10x smaller than expected)
- **mid_mt_gap**: Expected +8-10%, Actual +1.33% → **FAIL** (6-7x smaller than expected)

### Interpretation
Phase 6-A shows **statistically measurable but practically negligible** performance improvements:
- Excluding warmup: +4.07% (random_mixed), +1.97% (mid_mt_gap)
- Best 3 of 5: +1.67% (random_mixed), +1.33% (mid_mt_gap)
- All runs: -2.39% (random_mixed), +1.46% (mid_mt_gap)

The improvements are **8-10x smaller** than expected based on perf analysis.

---

## Root Cause Analysis

### Why the Discrepancy?

The perf profile showed `hak_super_lookup()` consuming **15.84% of CPU time**, yet removing it yields only **~1-4% improvement**. Possible explanations:

#### 1. **Compiler Optimization (Most Likely)**
The compiler may already be optimizing away the `hak_super_lookup()` call in release builds:
- **Dead Store Elimination**: The result of `hak_super_lookup()` is only used for debug logging
- **Inlining + Constant Propagation**: With LTO, the compiler sees the result is unused
- **Evidence**: Phase 6-A guard has minimal impact, suggesting code was already "free"

**Action**: Examine assembly output to verify if `hak_super_lookup()` is present in baseline build

#### 2. **Perf Sampling Bias**
The perf profile may have been captured during a different workload phase:
- Different allocation patterns (class distribution)
- Different cache states (cold vs. hot)
- Different thread counts (single vs. multi-threaded)

**Action**: Re-run perf on the exact benchmark workload to verify 15.84% claim

#### 3. **Measurement Noise**
The benchmarks show high variance:
- random_mixed: 1.86 M stdev (3.6% of mean)
- mid_mt_gap: 1.65 M stdev (4.1% of mean)

The measured improvements (+1-4%) are within **1-2 standard deviations** of noise.

**Action**: Run longer benchmarks (10M+ operations) to reduce noise

#### 4. **Lookup Already Cache-Friendly**
The SuperSlab registry lookup may be highly cache-efficient in these workloads:
- Small working set (256 blocks) fits in L1/L2 cache
- Registry entries for active SuperSlabs are hot
- Cost is much lower than perf's 15.84% suggests

**Action**: Benchmark with larger working sets (4KB+) to stress cache

#### 5. **Wrong Hot Path**
The perf profile showed 15.84% CPU in `hak_super_lookup()`, but this may not be on the **allocation hot path** that these benchmarks exercise:
- The call is in `tiny_region_id_write_header()` (allocation)
- Benchmarks mix alloc+free, free path may dominate
- Perf may have sampled during a malloc-heavy phase

**Action**: Isolate allocation-only benchmark (no frees) to verify

---

## Recommendations

### Immediate Actions

1. **HOLD** on committing Phase 6-A until investigation completes
   - Current results don't justify the change
   - Risk: code churn without measurable benefit

2. **Verify Compiler Behavior**
   ```bash
   # Generate assembly for baseline build
   gcc -S -DHAKMEM_BUILD_RELEASE=1 -O3 -o baseline.s core/tiny_region_id.h

   # Check if hak_super_lookup appears
   grep "hak_super_lookup" baseline.s

   # If absent: compiler already eliminated it (explains minimal improvement)
   # If present: something else is going on
   ```

3. **Re-run Perf on Benchmark Workload**
   ```bash
   # Build baseline without Phase 6-A
   git stash
   make clean && make bench_random_mixed_hakmem

   # Profile the exact benchmark
   perf record -g ./bench_random_mixed_hakmem 10000000 256 42
   perf report --stdio | grep -A20 "hak_super_lookup"

   # Verify if 15.84% claim holds for this workload
   ```

4. **Longer Benchmark Runs**
   ```bash
   # 100M operations to reduce noise
   for i in 1 2 3 4 5; do
       ./bench_random_mixed_hakmem 100000000 256 42 2>/dev/null
   done
   ```

### Long-Term Considerations

If investigation reveals:

#### Scenario A: Compiler Already Optimized
- **Decision**: Commit Phase 6-A for code cleanliness (no harm, no foul)
- **Rationale**: Explicitly documents debug-only code, prevents future confusion
- **Benefit**: Future-proof if compiler behavior changes

#### Scenario B: Perf Was Wrong
- **Decision**: Discard Phase 6-A, update perf methodology
- **Rationale**: The 15.84% CPU claim was based on flawed profiling
- **Action**: Document correct perf sampling procedure

#### Scenario C: Benchmark Doesn't Stress Hot Path
- **Decision**: Commit Phase 6-A, improve benchmark coverage
- **Rationale**: Real workloads may show the expected gains
- **Action**: Add allocation-heavy benchmark (e.g., 90% malloc, 10% free)

#### Scenario D: Measurement Noise Dominates
- **Decision**: Commit Phase 6-A if longer runs show >5% improvement
- **Rationale**: Noise can hide real improvements
- **Action**: Use mimalloc-bench suite for more stable measurements

---

## Next Steps

### Phase 6-B: Conditional Path Forward

**Option 1: Investigate First (Recommended)**
1. Run assembly analysis (1 hour)
2. Re-run perf on benchmark (2 hours)
3. Run longer benchmarks (4 hours)
4. Make data-driven decision

**Option 2: Commit Anyway**
- Rationale: Code is cleaner, no measurable harm
- Risk: Future confusion if optimization isn't actually needed

**Option 3: Discard Phase 6-A**
- Rationale: No measurable benefit, not worth the churn
- Risk: Miss real optimization if measurement was flawed

---

## Appendix: Full Benchmark Output

### Baseline - bench_random_mixed
```
=== Baseline: bench_random_mixed (Before Phase 6-A, SuperSlab lookup ALWAYS enabled) ===
Run 1: Throughput =  53806309 ops/s [iter=1000000 ws=256] time=0.019s
Run 2: Throughput =  53246568 ops/s [iter=1000000 ws=256] time=0.019s
Run 3: Throughput =  53563123 ops/s [iter=1000000 ws=256] time=0.019s
Run 4: Throughput =  49409566 ops/s [iter=1000000 ws=256] time=0.020s
Run 5: Throughput =  51412515 ops/s [iter=1000000 ws=256] time=0.019s
```

### Phase 6-A - bench_random_mixed
```
=== Phase 6-A: bench_random_mixed (Release build, SuperSlab lookup DISABLED) ===
Run 1: Throughput =  39111201 ops/s [iter=1000000 ws=256] time=0.026s
Run 2: Throughput =  53296242 ops/s [iter=1000000 ws=256] time=0.019s
Run 3: Throughput =  56279982 ops/s [iter=1000000 ws=256] time=0.018s
Run 4: Throughput =  52790754 ops/s [iter=1000000 ws=256] time=0.019s
Run 5: Throughput =  53715992 ops/s [iter=1000000 ws=256] time=0.019s
```

### Baseline - bench_mid_mt_gap
```
=== Baseline: bench_mid_mt_gap (Before Phase 6-A, SuperSlab lookup ALWAYS enabled) ===
Run 1: Throughput = 41.70 M operations per second, relative time: 0.023979 s.
Run 2: Throughput = 37.39 M operations per second, relative time: 0.026745 s.
Run 3: Throughput = 40.91 M operations per second, relative time: 0.024445 s.
Run 4: Throughput = 40.53 M operations per second, relative time: 0.024671 s.
Run 5: Throughput = 40.56 M operations per second, relative time: 0.024657 s.
```

### Phase 6-A - bench_mid_mt_gap
```
=== Phase 6-A: bench_mid_mt_gap (Release build, SuperSlab lookup DISABLED) ===
Run 1: Throughput = 41.49 M operations per second, relative time: 0.024103 s.
Run 2: Throughput = 41.81 M operations per second, relative time: 0.023917 s.
Run 3: Throughput = 41.51 M operations per second, relative time: 0.024089 s.
Run 4: Throughput = 38.43 M operations per second, relative time: 0.026019 s.
Run 5: Throughput = 40.78 M operations per second, relative time: 0.024524 s.
```

---

## Conclusion

Phase 6-A successfully implements the intended optimization (disabling SuperSlab lookup in release builds), but the measured performance impact (+1-4%) is **8-10x smaller** than the expected +12-15% based on perf analysis.

**Critical Question**: Why does removing code that perf claims costs 15.84% CPU only yield 1-4% improvement?

**Most Likely Answer**: The compiler was already optimizing away the `hak_super_lookup()` call in release builds through dead code elimination, since its result is only used for debug assertions.

**Recommended Action**: Investigate before committing. If the compiler was already optimizing, Phase 6-A is still valuable for code clarity and future-proofing, but the performance claim needs correction.
feat: Add ACE allocation failure tracing and debug hooks This commit introduces a comprehensive tracing mechanism for allocation failures within the Adaptive Cache Engine (ACE) component. This feature allows for precise identification of the root cause for Out-Of-Memory (OOM) issues related to ACE allocations. Key changes include: - ACE Tracing Implementation: - Added environment variable to enable/disable detailed logging of allocation failures. - Instrumented , , and to distinguish between "Threshold" (size class mismatch), "Exhaustion" (pool depletion), and "MapFail" (OS memory allocation failure). - Build System Fixes: - Corrected to ensure is properly linked into , resolving an error. - LD_PRELOAD Wrapper Adjustments: - Investigated and understood the wrapper's behavior under , particularly its interaction with and checks. - Enabled debugging flags for environment to prevent unintended fallbacks to 's for non-tiny allocations, allowing comprehensive testing of the allocator. - Debugging & Verification: - Introduced temporary verbose logging to pinpoint execution flow issues within interception and routing. These temporary logs have been removed. - Created to facilitate testing of the tracing features. This feature will significantly aid in diagnosing and resolving allocation-related OOM issues in by providing clear insights into the failure pathways. 2025-12-01 16:37:59 +09:00			`# Phase 6-A Benchmark Results`

			`Date: 2025-11-29`
			`Change: Disable SuperSlab lookup debug validation in RELEASE builds`
			File: `core/tiny_region_id.h:199-239`
			Guard: `#if !HAKMEM_BUILD_RELEASE` around `hak_super_lookup()` call
			`Reason: perf profiling showed 15.84% CPU cost on allocation hot path (debug-only validation)`

			`---`

			`## Executive Summary`

			`Phase 6-A implementation successfully removes debug validation overhead in release builds, but the measured performance impact is significantly smaller than predicted:`

			`- Expected: +12-15% (random_mixed), +8-10% (mid_mt_gap)`
			`- Actual (best 3 of 5): +1.67% (random_mixed), +1.33% (mid_mt_gap)`
			`- Actual (excluding warmup): +4.07% (random_mixed), +1.97% (mid_mt_gap)`

			`Recommendation: HOLD on commit. Investigate discrepancy between perf analysis (15.84% CPU) and benchmark results (~1-4% improvement).`

			`---`

			`## Benchmark Configuration`

			`### Build Configurations`

			`#### Baseline (Before Phase 6-A)`
			```bash
			`make clean`
			`make EXTRA_CFLAGS="-g -O3" bench_random_mixed_hakmem bench_mid_mt_gap_hakmem`
			`# Note: Makefile sets -DHAKMEM_BUILD_RELEASE=1 by default`
			`# Result: SuperSlab lookup ALWAYS enabled (no guard in code yet)`
			```

			`#### Phase 6-A (After)`
			```bash
			`git stash pop # Restore Phase 6-A changes`
			`make clean`
			`make EXTRA_CFLAGS="-g -O3" bench_random_mixed_hakmem bench_mid_mt_gap_hakmem`
			`# Note: Makefile sets -DHAKMEM_BUILD_RELEASE=1 by default`
			`# Result: SuperSlab lookup DISABLED (guarded by #if !HAKMEM_BUILD_RELEASE)`
			```

			`### Benchmark Parameters`
			`- Iterations: 1,000,000 operations per run`
			`- Working Set: 256 blocks`
			`- Seed: 42 (reproducible)`
			`- Runs: 5 per configuration`
			- Suppression: `2>/dev/null` to exclude debug output noise

			`---`

			`## Raw Results`

			`### bench_random_mixed (Tiny workload, 16B-1KB)`

			`#### Baseline (Before Phase 6-A, SuperSlab lookup ALWAYS enabled)`
			```
			`Run 1: 53.81 M ops/s`
			`Run 2: 53.25 M ops/s`
			`Run 3: 53.56 M ops/s`
			`Run 4: 49.41 M ops/s`
			`Run 5: 51.41 M ops/s`
			`Average: 52.29 M ops/s`
			`Stdev: 1.86 M ops/s`
			```

			`#### Phase 6-A (Release build, SuperSlab lookup DISABLED)`
			```
			`Run 1: 39.11 M ops/s ⚠️ OUTLIER (warmup)`
			`Run 2: 53.30 M ops/s`
			`Run 3: 56.28 M ops/s`
			`Run 4: 52.79 M ops/s`
			`Run 5: 53.72 M ops/s`
			`Average: 51.04 M ops/s (all runs)`
			`Stdev: 6.80 M ops/s (high due to outlier)`
			`Average (excl. Run 1): 54.02 M ops/s`
			```

			`Outlier Analysis: Run 1 is 27.6% slower than the average of runs 2-5, indicating a warmup/cache-cold issue.`

			`---`

			`### bench_mid_mt_gap (Mid MT workload, 1KB-8KB)`

			`#### Baseline (Before Phase 6-A, SuperSlab lookup ALWAYS enabled)`
			```
			`Run 1: 41.70 M ops/s`
			`Run 2: 37.39 M ops/s`
			`Run 3: 40.91 M ops/s`
			`Run 4: 40.53 M ops/s`
			`Run 5: 40.56 M ops/s`
			`Average: 40.22 M ops/s`
			`Stdev: 1.65 M ops/s`
			```

			`#### Phase 6-A (Release build, SuperSlab lookup DISABLED)`
			```
			`Run 1: 41.49 M ops/s`
			`Run 2: 41.81 M ops/s`
			`Run 3: 41.51 M ops/s`
			`Run 4: 38.43 M ops/s`
			`Run 5: 40.78 M ops/s`
			`Average: 40.80 M ops/s`
			`Stdev: 1.38 M ops/s`
			```

			`Variance Analysis: Both baseline and Phase 6-A show similar variance (~3-4 M ops/s spread), suggesting measurement noise is inherent to this benchmark.`

			`---`

			`## Statistical Analysis`

			`### Comparison 1: All Runs (Conservative)`
			`\| Benchmark \| Baseline \| Phase 6-A \| Absolute \| Relative \| Expected \| Result \|`
			`\|-----------\|----------\|-----------\|----------\|----------\|----------\|--------\|`
			`\| random_mixed \| 52.29 M \| 51.04 M \| -1.25 M \| -2.39% \| +12-15% \| ❌ FAIL \|`
			`\| mid_mt_gap \| 40.22 M \| 40.80 M \| +0.59 M \| +1.46% \| +8-10% \| ❌ FAIL \|`

			`### Comparison 2: Excluding First Run (Warmup Correction)`
			`\| Benchmark \| Baseline \| Phase 6-A \| Absolute \| Relative \| Expected \| Result \|`
			`\|-----------\|----------\|-----------\|----------\|----------\|----------\|--------\|`
			`\| random_mixed \| 51.91 M \| 54.02 M \| +2.11 M \| +4.07% \| +12-15% \| ⚠️ PARTIAL \|`
			`\| mid_mt_gap \| 39.85 M \| 40.63 M \| +0.78 M \| +1.97% \| +8-10% \| ❌ FAIL \|`

			`### Comparison 3: Best 3 of 5 (Peak Performance)`
			`\| Benchmark \| Baseline \| Phase 6-A \| Absolute \| Relative \| Expected \| Result \|`
			`\|-----------\|----------\|-----------\|----------\|----------\|----------\|--------\|`
			`\| random_mixed \| 53.54 M \| 54.43 M \| +0.89 M \| +1.67% \| +12-15% \| ❌ FAIL \|`
			`\| mid_mt_gap \| 41.06 M \| 41.60 M \| +0.54 M \| +1.33% \| +8-10% \| ❌ FAIL \|`

			`---`

			`## Performance Summary`

			`### Overall Results (Best 3 of 5 method)`
			`- random_mixed: 53.54 → 54.43 M ops/s (+1.67%)`
			`- mid_mt_gap: 41.06 → 41.60 M ops/s (+1.33%)`

			`### vs Predictions`
			`- random_mixed: Expected +12-15%, Actual +1.67% → FAIL (8-10x smaller than expected)`
			`- mid_mt_gap: Expected +8-10%, Actual +1.33% → FAIL (6-7x smaller than expected)`

			`### Interpretation`
			`Phase 6-A shows statistically measurable but practically negligible performance improvements:`
			`- Excluding warmup: +4.07% (random_mixed), +1.97% (mid_mt_gap)`
			`- Best 3 of 5: +1.67% (random_mixed), +1.33% (mid_mt_gap)`
			`- All runs: -2.39% (random_mixed), +1.46% (mid_mt_gap)`

			`The improvements are 8-10x smaller than expected based on perf analysis.`

			`---`

			`## Root Cause Analysis`

			`### Why the Discrepancy?`

			The perf profile showed `hak_super_lookup()` consuming 15.84% of CPU time, yet removing it yields only ~1-4% improvement. Possible explanations:

			`#### 1. Compiler Optimization (Most Likely)`
			The compiler may already be optimizing away the `hak_super_lookup()` call in release builds:
			- Dead Store Elimination: The result of `hak_super_lookup()` is only used for debug logging
			`- Inlining + Constant Propagation: With LTO, the compiler sees the result is unused`
			`- Evidence: Phase 6-A guard has minimal impact, suggesting code was already "free"`

			Action: Examine assembly output to verify if `hak_super_lookup()` is present in baseline build

			`#### 2. Perf Sampling Bias`
			`The perf profile may have been captured during a different workload phase:`
			`- Different allocation patterns (class distribution)`
			`- Different cache states (cold vs. hot)`
			`- Different thread counts (single vs. multi-threaded)`

			`Action: Re-run perf on the exact benchmark workload to verify 15.84% claim`

			`#### 3. Measurement Noise`
			`The benchmarks show high variance:`
			`- random_mixed: 1.86 M stdev (3.6% of mean)`
			`- mid_mt_gap: 1.65 M stdev (4.1% of mean)`

			`The measured improvements (+1-4%) are within 1-2 standard deviations of noise.`

			`Action: Run longer benchmarks (10M+ operations) to reduce noise`

			`#### 4. Lookup Already Cache-Friendly`
			`The SuperSlab registry lookup may be highly cache-efficient in these workloads:`
			`- Small working set (256 blocks) fits in L1/L2 cache`
			`- Registry entries for active SuperSlabs are hot`
			`- Cost is much lower than perf's 15.84% suggests`

			`Action: Benchmark with larger working sets (4KB+) to stress cache`

			`#### 5. Wrong Hot Path`
			The perf profile showed 15.84% CPU in `hak_super_lookup()`, but this may not be on the allocation hot path that these benchmarks exercise:
			- The call is in `tiny_region_id_write_header()` (allocation)
			`- Benchmarks mix alloc+free, free path may dominate`
			`- Perf may have sampled during a malloc-heavy phase`

			`Action: Isolate allocation-only benchmark (no frees) to verify`

			`---`

			`## Recommendations`

			`### Immediate Actions`

			`1. HOLD on committing Phase 6-A until investigation completes`
			`- Current results don't justify the change`
			`- Risk: code churn without measurable benefit`

			`2. Verify Compiler Behavior`
			```bash
			`# Generate assembly for baseline build`
			`gcc -S -DHAKMEM_BUILD_RELEASE=1 -O3 -o baseline.s core/tiny_region_id.h`

			`# Check if hak_super_lookup appears`
			`grep "hak_super_lookup" baseline.s`

			`# If absent: compiler already eliminated it (explains minimal improvement)`
			`# If present: something else is going on`
			```

			`3. Re-run Perf on Benchmark Workload`
			```bash
			`# Build baseline without Phase 6-A`
			`git stash`
			`make clean && make bench_random_mixed_hakmem`

			`# Profile the exact benchmark`
			`perf record -g ./bench_random_mixed_hakmem 10000000 256 42`
			`perf report --stdio \| grep -A20 "hak_super_lookup"`

			`# Verify if 15.84% claim holds for this workload`
			```

			`4. Longer Benchmark Runs`
			```bash
			`# 100M operations to reduce noise`
			`for i in 1 2 3 4 5; do`
			`./bench_random_mixed_hakmem 100000000 256 42 2>/dev/null`
			`done`
			```

			`### Long-Term Considerations`

			`If investigation reveals:`

			`#### Scenario A: Compiler Already Optimized`
			`- Decision: Commit Phase 6-A for code cleanliness (no harm, no foul)`
			`- Rationale: Explicitly documents debug-only code, prevents future confusion`
			`- Benefit: Future-proof if compiler behavior changes`

			`#### Scenario B: Perf Was Wrong`
			`- Decision: Discard Phase 6-A, update perf methodology`
			`- Rationale: The 15.84% CPU claim was based on flawed profiling`
			`- Action: Document correct perf sampling procedure`

			`#### Scenario C: Benchmark Doesn't Stress Hot Path`
			`- Decision: Commit Phase 6-A, improve benchmark coverage`
			`- Rationale: Real workloads may show the expected gains`
			`- Action: Add allocation-heavy benchmark (e.g., 90% malloc, 10% free)`

			`#### Scenario D: Measurement Noise Dominates`
			`- Decision: Commit Phase 6-A if longer runs show >5% improvement`
			`- Rationale: Noise can hide real improvements`
			`- Action: Use mimalloc-bench suite for more stable measurements`

			`---`

			`## Next Steps`

			`### Phase 6-B: Conditional Path Forward`

			`Option 1: Investigate First (Recommended)`
			`1. Run assembly analysis (1 hour)`
			`2. Re-run perf on benchmark (2 hours)`
			`3. Run longer benchmarks (4 hours)`
			`4. Make data-driven decision`

			`Option 2: Commit Anyway`
			`- Rationale: Code is cleaner, no measurable harm`
			`- Risk: Future confusion if optimization isn't actually needed`

			`Option 3: Discard Phase 6-A`
			`- Rationale: No measurable benefit, not worth the churn`
			`- Risk: Miss real optimization if measurement was flawed`

			`---`

			`## Appendix: Full Benchmark Output`

			`### Baseline - bench_random_mixed`
			```
			`=== Baseline: bench_random_mixed (Before Phase 6-A, SuperSlab lookup ALWAYS enabled) ===`
			`Run 1: Throughput = 53806309 ops/s [iter=1000000 ws=256] time=0.019s`
			`Run 2: Throughput = 53246568 ops/s [iter=1000000 ws=256] time=0.019s`
			`Run 3: Throughput = 53563123 ops/s [iter=1000000 ws=256] time=0.019s`
			`Run 4: Throughput = 49409566 ops/s [iter=1000000 ws=256] time=0.020s`
			`Run 5: Throughput = 51412515 ops/s [iter=1000000 ws=256] time=0.019s`
			```

			`### Phase 6-A - bench_random_mixed`
			```
			`=== Phase 6-A: bench_random_mixed (Release build, SuperSlab lookup DISABLED) ===`
			`Run 1: Throughput = 39111201 ops/s [iter=1000000 ws=256] time=0.026s`
			`Run 2: Throughput = 53296242 ops/s [iter=1000000 ws=256] time=0.019s`
			`Run 3: Throughput = 56279982 ops/s [iter=1000000 ws=256] time=0.018s`
			`Run 4: Throughput = 52790754 ops/s [iter=1000000 ws=256] time=0.019s`
			`Run 5: Throughput = 53715992 ops/s [iter=1000000 ws=256] time=0.019s`
			```

			`### Baseline - bench_mid_mt_gap`
			```
			`=== Baseline: bench_mid_mt_gap (Before Phase 6-A, SuperSlab lookup ALWAYS enabled) ===`
			`Run 1: Throughput = 41.70 M operations per second, relative time: 0.023979 s.`
			`Run 2: Throughput = 37.39 M operations per second, relative time: 0.026745 s.`
			`Run 3: Throughput = 40.91 M operations per second, relative time: 0.024445 s.`
			`Run 4: Throughput = 40.53 M operations per second, relative time: 0.024671 s.`
			`Run 5: Throughput = 40.56 M operations per second, relative time: 0.024657 s.`
			```

			`### Phase 6-A - bench_mid_mt_gap`
			```
			`=== Phase 6-A: bench_mid_mt_gap (Release build, SuperSlab lookup DISABLED) ===`
			`Run 1: Throughput = 41.49 M operations per second, relative time: 0.024103 s.`
			`Run 2: Throughput = 41.81 M operations per second, relative time: 0.023917 s.`
			`Run 3: Throughput = 41.51 M operations per second, relative time: 0.024089 s.`
			`Run 4: Throughput = 38.43 M operations per second, relative time: 0.026019 s.`
			`Run 5: Throughput = 40.78 M operations per second, relative time: 0.024524 s.`
			```

			`---`

			`## Conclusion`

			`Phase 6-A successfully implements the intended optimization (disabling SuperSlab lookup in release builds), but the measured performance impact (+1-4%) is 8-10x smaller than the expected +12-15% based on perf analysis.`

			`Critical Question: Why does removing code that perf claims costs 15.84% CPU only yield 1-4% improvement?`

			Most Likely Answer: The compiler was already optimizing away the `hak_super_lookup()` call in release builds through dead code elimination, since its result is only used for debug assertions.

			`Recommended Action: Investigate before committing. If the compiler was already optimizing, Phase 6-A is still valuable for code clarity and future-proofing, but the performance claim needs correction.`