### Changes:
- **Removed duplicate fast paths**: Disabled HAKMEM_TINY_FAST_PATH in:
- malloc() entry point (line ~1257)
- hak_alloc_at() helper (line ~682)
- **Unified to Box Theory**: All tiny allocations now use Box Theory's
hak_tiny_alloc_fast_wrapper() at line ~712 (HAKMEM_TINY_PHASE6_BOX_REFACTOR)
### Rationale:
- Previous implementation had **2 fast path checks** (double overhead)
- Box Theory (tiny_alloc_fast.inc.h) provides optimized 3-4 instruction path
- CLAUDE.md claims +64% (debug), +150% (production) with Box Theory
- Attempt to eliminate redundant checks and unify to single fast path
### Performance Results:
⚠️ **REGRESSION** - Performance decreased:
```
Baseline (old tiny_fast_alloc): 1.68M ops/s
Box Theory (unified): 1.35M ops/s (-20%)
System malloc: 8.08M ops/s (reference)
```
### Status:
🔬 **EXPERIMENTAL** - This commit documents the attempt but shows regression.
Possible issues:
1. Box Theory may need additional tuning (env vars not sufficient)
2. Refill backend may be slower than old implementation
3. TLS freelist initialization overhead
4. Missing optimizations in Box Theory integration
### Next Steps:
- Profile to identify why Box Theory is slower
- Compare refill efficiency: old vs Box Theory
- Check if TLS SLL variables are properly initialized
- Consider reverting if root cause not found
Related: LARSON_PERFORMANCE_ANALYSIS_2025_11_05.md, CLAUDE.md Phase 6-1.7
Implementation: Move HAKMEM_TINY_FAST_PATH check BEFORE all guard checks
in malloc(), inspired by mimalloc/tcache entry point design.
Strategy:
- tcache has 0 branches before fast path
- mimalloc has 1-2 branches before fast path
- Old HAKMEM had 8+ branches before fast path
- Phase 1: Move fast path to line 1, add branch prediction hints
Changes in core/hakmem.c:
1. Fast Path First: Size check → Init check → Cache hit (3 branches)
2. Slow Path: All guards moved after fast path (rare cases)
3. Branch hints: __builtin_expect() for hot paths
Expected results (from research):
- ST: 0.46M → 1.4-2.3M ops/s (+204-400%)
- MT: 1.86M → 3.7-5.6M ops/s (+99-201%)
Actual results (Larson 2s 8-128B 1024):
- ST: 0.377M → 0.424M ops/s (+12% only)
- MT: 1.856M → 1.453M ops/s (-22% regression!)
Analysis:
- Similar pattern to previous Option A test (+42% ST, -20% MT)
- Entry point reordering alone is insufficient
- True bottleneck may be:
1. tiny_fast_alloc() internals (size-to-class, cache access)
2. Refill cost (1,600 cycles for 16 individual calls)
3. Need Batch Refill optimization (Phase 3) as priority
Next steps:
- Investigate refill bottleneck with perf profiling
- Consider implementing Phase 3 (Batch Refill) before Phase 2
- May need combination of multiple optimizations for breakthrough
Related: LARSON_PERFORMANCE_ANALYSIS_2025_11_05.md
Changes:
- Reorder malloc() to prioritize Fast Path (initialized + tiny size check first)
- Move Fast Path check before all guard checks (recursion, LD_PRELOAD, etc.)
- Optimize free() with same strategy (initialized check first)
- Add branch prediction hints (__builtin_expect)
Implementation:
- malloc(): Fast Path now executes with 3 branches total
- Branch 1+2: g_initialized && size <= TINY_FAST_THRESHOLD
- Branch 3: tiny_fast_alloc() cache hit check
- Slow Path: All guard checks moved after Fast Path miss
- free(): Fast Path with 1-2 branches
- Branch 1: g_initialized check
- Direct to hak_free_at() on normal case
Performance Results (Larson benchmark, size=8-128B):
Single-thread (threads=1):
- Before: 0.46M ops/s (10.7% of system malloc)
- After: 0.65M ops/s (15.4% of system malloc)
- Change: +42% improvement ✓
Multi-thread (threads=4):
- Before: 1.81M ops/s (25.0% of system malloc)
- After: 1.44M ops/s (19.9% of system malloc)
- Change: -20% regression ✗
Analysis:
- ST improvement shows Fast Path optimization works
- MT regression suggests contention or cache issues
- Did not meet target (+200-400%), further optimization needed
Next Steps:
- Investigate MT regression (cache coherency?)
- Consider more aggressive inlining
- Explore Option B (Refill optimization)