### Changes:
1. **Removed diagnostic from wrapper** (hakmem_tiny.c:1542)
- Was: getenv() + fprintf() on every wrapper call
- Now: Direct return tiny_alloc_fast(size)
- Relies on LTO (-flto) for inlining
2. **Removed counter overhead from malloc()** (hakmem.c:1242)
- Was: 4 TLS counter increments per malloc
- g_malloc_total_calls++
- g_malloc_tiny_size_match++
- g_malloc_fast_path_tried++
- g_malloc_fast_path_null++ (on miss)
- Now: Zero counter overhead
### Performance Results:
```
Before (with overhead): 1.51M ops/s
After (zero overhead): 1.59M ops/s (+5% 🎉)
Baseline (old impl): 1.68M ops/s (-5% gap remains)
System malloc: 8.08M ops/s (reference)
```
### Analysis:
**What was heavy:**
- Counter increments: ~4 TLS writes per malloc (cache pollution)
- Diagnostic: getenv() + fprintf() check (even if disabled)
- These added ~80K ops/s overhead
**Remaining gap (-5% vs baseline):**
Box Theory (1.59M) vs Old implementation (1.68M)
- Likely due to: ownership check in free path
- Or: refill backend (sll_refill_small_from_ss vs hak_tiny_alloc x16)
### Bottleneck Update:
From profiling data (2,418 cycles per fast path):
```
Fast path time: 49.5M cycles (49.1% of total)
Refill time: 51.3M cycles (50.9% of total)
Counter overhead removed: ~5% improvement
LTO should inline wrapper: Further gains expected
```
### Status:
✅ IMPROVEMENT - Removed overhead, 5% faster
❌ STILL SHORT - 5% slower than baseline (1.68M target)
### Next Steps:
A. Investigate ownership check overhead in free path
B. Compare refill backend efficiency
C. Consider reverting to old implementation if gap persists
Related: LARSON_PERFORMANCE_ANALYSIS_2025_11_05.md