### Changes:
1. **Removed diagnostic from wrapper** (hakmem_tiny.c:1542)
- Was: getenv() + fprintf() on every wrapper call
- Now: Direct return tiny_alloc_fast(size)
- Relies on LTO (-flto) for inlining
2. **Removed counter overhead from malloc()** (hakmem.c:1242)
- Was: 4 TLS counter increments per malloc
- g_malloc_total_calls++
- g_malloc_tiny_size_match++
- g_malloc_fast_path_tried++
- g_malloc_fast_path_null++ (on miss)
- Now: Zero counter overhead
### Performance Results:
```
Before (with overhead): 1.51M ops/s
After (zero overhead): 1.59M ops/s (+5% 🎉)
Baseline (old impl): 1.68M ops/s (-5% gap remains)
System malloc: 8.08M ops/s (reference)
```
### Analysis:
**What was heavy:**
- Counter increments: ~4 TLS writes per malloc (cache pollution)
- Diagnostic: getenv() + fprintf() check (even if disabled)
- These added ~80K ops/s overhead
**Remaining gap (-5% vs baseline):**
Box Theory (1.59M) vs Old implementation (1.68M)
- Likely due to: ownership check in free path
- Or: refill backend (sll_refill_small_from_ss vs hak_tiny_alloc x16)
### Bottleneck Update:
From profiling data (2,418 cycles per fast path):
```
Fast path time: 49.5M cycles (49.1% of total)
Refill time: 51.3M cycles (50.9% of total)
Counter overhead removed: ~5% improvement
LTO should inline wrapper: Further gains expected
```
### Status:
✅ IMPROVEMENT - Removed overhead, 5% faster
❌ STILL SHORT - 5% slower than baseline (1.68M target)
### Next Steps:
A. Investigate ownership check overhead in free path
B. Compare refill backend efficiency
C. Consider reverting to old implementation if gap persists
Related: LARSON_PERFORMANCE_ANALYSIS_2025_11_05.md
### Problem Identified:
Previous commit routed malloc() → guards → hak_alloc_at() → Box Theory
This added massive overhead (guard checks, function calls) defeating the
"3-4 instruction" fast path promise.
### Root Cause:
"命令数減って遅くなるのはおかしい" - User's insight was correct!
Box Theory claims 3-4 instructions, but routing added dozens of instructions
before reaching TLS freelist pop.
### Fix:
Move Box Theory call to malloc() entry point (line ~1253), BEFORE all guards:
```c
#ifdef HAKMEM_TINY_PHASE6_BOX_REFACTOR
if (size <= TINY_FAST_THRESHOLD) {
void* ptr = hak_tiny_alloc_fast_wrapper(size);
if (ptr) return ptr; // ✅ Fast path: No guards, no overhead
}
#endif
// SLOW PATH: All guards here...
```
### Performance Results:
```
Baseline (old tiny_fast_alloc): 1.68M ops/s
Box Theory (no env vars): 1.22M ops/s (-27%)
Box Theory (with env vars): 1.39M ops/s (-17%) ← Improved!
System malloc: 8.08M ops/s
CLAUDE.md expectation: 2.75M (+64%) ~ 4.19M (+150%) ← Not reached
```
### Env Vars Used:
```
HAKMEM_TINY_TLS_SLL=1 HAKMEM_TINY_TLS_LIST=0 HAKMEM_TINY_HOTMAG=0
HAKMEM_WRAP_TINY=1 HAKMEM_TINY_SS_ADOPT=1 HAKMEM_TINY_TRACE_RING=0
HAKMEM_SAFE_FREE=0 HAKMEM_TINY_REFILL_COUNT=128
```
### Verification:
- ✅ HAKMEM_TINY_PHASE6_BOX_REFACTOR=1 confirmed active
- ✅ hak_tiny_alloc_fast_wrapper() called (FRONT diagnostics)
- ✅ Routing now bypasses guards for fast path
- ❌ Still -17% slower than baseline (investigation needed)
### Status:
🔬 PARTIAL SUCCESS - Routing fixed, but performance below expectation.
Box Theory is active and bypassing guards, but still slower than old implementation.
### Next Steps:
- Compare refill implementations (old vs Box Theory)
- Profile to identify specific bottleneck
- Investigate why Box Theory underperforms vs CLAUDE.md claims
Related: LARSON_PERFORMANCE_ANALYSIS_2025_11_05.md, CLAUDE.md Phase 6-1.7
### Changes:
- **Removed duplicate fast paths**: Disabled HAKMEM_TINY_FAST_PATH in:
- malloc() entry point (line ~1257)
- hak_alloc_at() helper (line ~682)
- **Unified to Box Theory**: All tiny allocations now use Box Theory's
hak_tiny_alloc_fast_wrapper() at line ~712 (HAKMEM_TINY_PHASE6_BOX_REFACTOR)
### Rationale:
- Previous implementation had **2 fast path checks** (double overhead)
- Box Theory (tiny_alloc_fast.inc.h) provides optimized 3-4 instruction path
- CLAUDE.md claims +64% (debug), +150% (production) with Box Theory
- Attempt to eliminate redundant checks and unify to single fast path
### Performance Results:
⚠️ **REGRESSION** - Performance decreased:
```
Baseline (old tiny_fast_alloc): 1.68M ops/s
Box Theory (unified): 1.35M ops/s (-20%)
System malloc: 8.08M ops/s (reference)
```
### Status:
🔬 **EXPERIMENTAL** - This commit documents the attempt but shows regression.
Possible issues:
1. Box Theory may need additional tuning (env vars not sufficient)
2. Refill backend may be slower than old implementation
3. TLS freelist initialization overhead
4. Missing optimizations in Box Theory integration
### Next Steps:
- Profile to identify why Box Theory is slower
- Compare refill efficiency: old vs Box Theory
- Check if TLS SLL variables are properly initialized
- Consider reverting if root cause not found
Related: LARSON_PERFORMANCE_ANALYSIS_2025_11_05.md, CLAUDE.md Phase 6-1.7
Implementation: Move HAKMEM_TINY_FAST_PATH check BEFORE all guard checks
in malloc(), inspired by mimalloc/tcache entry point design.
Strategy:
- tcache has 0 branches before fast path
- mimalloc has 1-2 branches before fast path
- Old HAKMEM had 8+ branches before fast path
- Phase 1: Move fast path to line 1, add branch prediction hints
Changes in core/hakmem.c:
1. Fast Path First: Size check → Init check → Cache hit (3 branches)
2. Slow Path: All guards moved after fast path (rare cases)
3. Branch hints: __builtin_expect() for hot paths
Expected results (from research):
- ST: 0.46M → 1.4-2.3M ops/s (+204-400%)
- MT: 1.86M → 3.7-5.6M ops/s (+99-201%)
Actual results (Larson 2s 8-128B 1024):
- ST: 0.377M → 0.424M ops/s (+12% only)
- MT: 1.856M → 1.453M ops/s (-22% regression!)
Analysis:
- Similar pattern to previous Option A test (+42% ST, -20% MT)
- Entry point reordering alone is insufficient
- True bottleneck may be:
1. tiny_fast_alloc() internals (size-to-class, cache access)
2. Refill cost (1,600 cycles for 16 individual calls)
3. Need Batch Refill optimization (Phase 3) as priority
Next steps:
- Investigate refill bottleneck with perf profiling
- Consider implementing Phase 3 (Batch Refill) before Phase 2
- May need combination of multiple optimizations for breakthrough
Related: LARSON_PERFORMANCE_ANALYSIS_2025_11_05.md
Changes:
- Reorder malloc() to prioritize Fast Path (initialized + tiny size check first)
- Move Fast Path check before all guard checks (recursion, LD_PRELOAD, etc.)
- Optimize free() with same strategy (initialized check first)
- Add branch prediction hints (__builtin_expect)
Implementation:
- malloc(): Fast Path now executes with 3 branches total
- Branch 1+2: g_initialized && size <= TINY_FAST_THRESHOLD
- Branch 3: tiny_fast_alloc() cache hit check
- Slow Path: All guard checks moved after Fast Path miss
- free(): Fast Path with 1-2 branches
- Branch 1: g_initialized check
- Direct to hak_free_at() on normal case
Performance Results (Larson benchmark, size=8-128B):
Single-thread (threads=1):
- Before: 0.46M ops/s (10.7% of system malloc)
- After: 0.65M ops/s (15.4% of system malloc)
- Change: +42% improvement ✓
Multi-thread (threads=4):
- Before: 1.81M ops/s (25.0% of system malloc)
- After: 1.44M ops/s (19.9% of system malloc)
- Change: -20% regression ✗
Analysis:
- ST improvement shows Fast Path optimization works
- MT regression suggests contention or cache issues
- Did not meet target (+200-400%), further optimization needed
Next Steps:
- Investigate MT regression (cache coherency?)
- Consider more aggressive inlining
- Explore Option B (Refill optimization)