Optimization:
=============
Add HAKMEM_BUILD_RELEASE check to trc_refill_guard_enabled():
- Release builds (NDEBUG defined): Always return 0 (no logging)
- Debug builds: Check HAKMEM_TINY_REFILL_FAILFAST env var
This eliminates fprintf() calls and getenv() overhead in release builds.
Benchmark Results:
==================
Before: 1,015,347 ops/s
After: 1,046,392 ops/s
→ +3.1% improvement! 🚀
Perf Analysis (before fix):
- buffered_vfprintf: 4.90% CPU (fprintf overhead)
- hak_tiny_free_superslab: 52.63% (main hotspot)
- superslab_refill: 14.53%
Note: NDEBUG is not currently defined in Makefile, so
HAKMEM_BUILD_RELEASE=0 by default. Real gains will be
higher with -DNDEBUG in production builds.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Optimization:
=============
Check remote_counts[slab_idx] BEFORE calling drain function.
If remote queue is empty (count == 0), skip the drain entirely.
Impact:
- Single-threaded: remote_count is ALWAYS 0 → drain calls = 0
- Multi-threaded: only drain when there are actual remote frees
- Reduces unnecessary function call overhead in common case
Code:
if (tls->ss && tls->slab_idx >= 0) {
uint32_t remote_count = atomic_load_explicit(
&tls->ss->remote_counts[tls->slab_idx], memory_order_relaxed);
if (remote_count > 0) {
_ss_remote_drain_to_freelist_unsafe(tls->ss, tls->slab_idx, meta);
}
}
Benchmark Results:
==================
bench_random_mixed (1 thread):
Before: 1,020,163 ops/s
After: 1,015,347 ops/s (-0.5%, within noise)
larson_hakmem (4 threads):
Before: 931,629 ops/s (1073 sec)
After: 929,709 ops/s (1075 sec) (-0.2%, within noise)
Note: Performance unchanged, but code is cleaner and avoids
unnecessary work in single-threaded case. Real bottleneck
appears to be elsewhere (Magazine layer overhead per CLAUDE.md).
Next: Profile with perf to find actual hotspots.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Root Cause:
===========
Freelist and remote queue contained the SAME blocks, causing use-after-free:
1. Thread A (owner): pops block X from freelist → allocates to user
2. User writes data ("ab") to block X
3. Thread B (remote): free(block X) → adds to remote queue
4. Thread A (later): drains remote queue → *(void**)block_X = chain_head
→ OVERWRITES USER DATA! 💥
The freelist pop path did NOT drain the remote queue first, so blocks could
be simultaneously in both freelist and remote queue.
Fix:
====
Add remote queue drain BEFORE freelist pop in refill path:
core/hakmem_tiny_refill_p0.inc.h:
- Call _ss_remote_drain_to_freelist_unsafe() BEFORE trc_pop_from_freelist()
- Add #include "superslab/superslab_inline.h"
- This ensures freelist and remote queue are mutually exclusive
Test Results:
=============
BEFORE:
larson_hakmem (4 threads): ❌ SEGV in seconds (freelist corruption)
AFTER:
larson_hakmem (4 threads): ✅ 931,629 ops/s (1073 sec stable run)
bench_random_mixed: ✅ 1,020,163 ops/s (no crashes)
Evidence:
- Fail-Fast logs showed next pointer corruption: 0x...6261 (ASCII "ab")
- Single-threaded benchmarks worked (865K ops/s)
- Multi-threaded Larson crashed immediately
- Fix eliminates all crashes in both benchmarks
Files:
- core/hakmem_tiny_refill_p0.inc.h: Add remote drain before freelist pop
- CURRENT_TASK.md: Document fix details
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Problem:
- tiny_remote_side_set() has fallback: writes to node memory if table full
- tiny_remote_side_get() had NO fallback: returns 0 when lookup fails
- This breaks remote queue drain chain traversal
- Remaining nodes stay in queue with sentinel 0xBADA55BADA55BADA
- Later allocations return corrupted nodes → SEGV
Changes:
- core/tiny_remote.c:598-606
- Added fallback to read from node memory when side table lookup fails
- Added sentinel check: return 0 if sentinel present (entry was evicted)
- Matches set() behavior at line 583
Result:
- Improved (but not complete fix)
- Freelist corruption still occurs
- Issue appears deeper than simple side table lookup failure
Next:
- SuperSlab refactoring needed (500+ lines in .h)
- Root cause investigation with ultrathink
Related commits:
- b8ed2b05b: Phase 6-2.6 (slab_data_start consistency)
- d2f0d8458: Phase 6-2.5 (constants + 2048 offset)
🤖 Generated with Claude Code
Co-Authored-By: Claude <noreply@anthropic.com>
**Problem**: After previous fixes, 4T Larson success rate dropped 27% (4/15)
**Root Cause**:
In `log_superslab_oom_once()`, `g_hakmem_lock_depth++` was placed AFTER
`getrlimit()` call. However, the function was already called from within
malloc wrapper context where `g_hakmem_lock_depth = 1`.
When `getrlimit()` or other LIBC functions call `malloc()` internally,
they enter the wrapper with lock_depth=1, but the increment to 2 hasn't
happened yet, so getenv() in wrapper can trigger recursion.
**Fix**:
Move `g_hakmem_lock_depth++` to the VERY FIRST line after early return check.
This ensures ALL subsequent LIBC calls (getrlimit, fopen, fclose, fprintf)
bypass HAKMEM wrapper.
**Result**: 4T Larson success rate improved 27% → 70% (14/20 runs) ✅
+43% improvement, but 30% crash rate remains (continuing investigation)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
- Changed adopt loop from best-fit (scoring all 32 slabs) to first-fit
- Stop at first slab with freelist instead of scanning all 32
- Expected: -3,000 cycles per refill (eliminate 64 atomic loads + 32 scores)
Result: No measurable improvement (1.23M → 1.25M ops/s, ±0%)
Analysis:
- Adopt loop may not be executed frequently enough
- Larson benchmark hit rate might bypass adopt path
- Best-fit scoring overhead was smaller than estimated
Note: Fix#1 (getenv caching) was attempted but reverted due to -22% regression.
Global variable access overhead exceeded saved getenv() cost.
Profiling Results:
- Fast path: 143 cycles (10.4% of time) ✅ Good
- Refill: 19,624 cycles (89.6% of time) 🚨 Bottleneck!
Refill is 137x slower than fast path and dominates total cost.
Only happens 6.3% of the time but takes 90% of execution time.
Next: Optimize sll_refill_small_from_ss() backend.
### Changes:
1. **Removed diagnostic from wrapper** (hakmem_tiny.c:1542)
- Was: getenv() + fprintf() on every wrapper call
- Now: Direct return tiny_alloc_fast(size)
- Relies on LTO (-flto) for inlining
2. **Removed counter overhead from malloc()** (hakmem.c:1242)
- Was: 4 TLS counter increments per malloc
- g_malloc_total_calls++
- g_malloc_tiny_size_match++
- g_malloc_fast_path_tried++
- g_malloc_fast_path_null++ (on miss)
- Now: Zero counter overhead
### Performance Results:
```
Before (with overhead): 1.51M ops/s
After (zero overhead): 1.59M ops/s (+5% 🎉)
Baseline (old impl): 1.68M ops/s (-5% gap remains)
System malloc: 8.08M ops/s (reference)
```
### Analysis:
**What was heavy:**
- Counter increments: ~4 TLS writes per malloc (cache pollution)
- Diagnostic: getenv() + fprintf() check (even if disabled)
- These added ~80K ops/s overhead
**Remaining gap (-5% vs baseline):**
Box Theory (1.59M) vs Old implementation (1.68M)
- Likely due to: ownership check in free path
- Or: refill backend (sll_refill_small_from_ss vs hak_tiny_alloc x16)
### Bottleneck Update:
From profiling data (2,418 cycles per fast path):
```
Fast path time: 49.5M cycles (49.1% of total)
Refill time: 51.3M cycles (50.9% of total)
Counter overhead removed: ~5% improvement
LTO should inline wrapper: Further gains expected
```
### Status:
✅ IMPROVEMENT - Removed overhead, 5% faster
❌ STILL SHORT - 5% slower than baseline (1.68M target)
### Next Steps:
A. Investigate ownership check overhead in free path
B. Compare refill backend efficiency
C. Consider reverting to old implementation if gap persists
Related: LARSON_PERFORMANCE_ANALYSIS_2025_11_05.md
### Problem Identified:
Previous commit routed malloc() → guards → hak_alloc_at() → Box Theory
This added massive overhead (guard checks, function calls) defeating the
"3-4 instruction" fast path promise.
### Root Cause:
"命令数減って遅くなるのはおかしい" - User's insight was correct!
Box Theory claims 3-4 instructions, but routing added dozens of instructions
before reaching TLS freelist pop.
### Fix:
Move Box Theory call to malloc() entry point (line ~1253), BEFORE all guards:
```c
#ifdef HAKMEM_TINY_PHASE6_BOX_REFACTOR
if (size <= TINY_FAST_THRESHOLD) {
void* ptr = hak_tiny_alloc_fast_wrapper(size);
if (ptr) return ptr; // ✅ Fast path: No guards, no overhead
}
#endif
// SLOW PATH: All guards here...
```
### Performance Results:
```
Baseline (old tiny_fast_alloc): 1.68M ops/s
Box Theory (no env vars): 1.22M ops/s (-27%)
Box Theory (with env vars): 1.39M ops/s (-17%) ← Improved!
System malloc: 8.08M ops/s
CLAUDE.md expectation: 2.75M (+64%) ~ 4.19M (+150%) ← Not reached
```
### Env Vars Used:
```
HAKMEM_TINY_TLS_SLL=1 HAKMEM_TINY_TLS_LIST=0 HAKMEM_TINY_HOTMAG=0
HAKMEM_WRAP_TINY=1 HAKMEM_TINY_SS_ADOPT=1 HAKMEM_TINY_TRACE_RING=0
HAKMEM_SAFE_FREE=0 HAKMEM_TINY_REFILL_COUNT=128
```
### Verification:
- ✅ HAKMEM_TINY_PHASE6_BOX_REFACTOR=1 confirmed active
- ✅ hak_tiny_alloc_fast_wrapper() called (FRONT diagnostics)
- ✅ Routing now bypasses guards for fast path
- ❌ Still -17% slower than baseline (investigation needed)
### Status:
🔬 PARTIAL SUCCESS - Routing fixed, but performance below expectation.
Box Theory is active and bypassing guards, but still slower than old implementation.
### Next Steps:
- Compare refill implementations (old vs Box Theory)
- Profile to identify specific bottleneck
- Investigate why Box Theory underperforms vs CLAUDE.md claims
Related: LARSON_PERFORMANCE_ANALYSIS_2025_11_05.md, CLAUDE.md Phase 6-1.7
### Changes:
- **Removed duplicate fast paths**: Disabled HAKMEM_TINY_FAST_PATH in:
- malloc() entry point (line ~1257)
- hak_alloc_at() helper (line ~682)
- **Unified to Box Theory**: All tiny allocations now use Box Theory's
hak_tiny_alloc_fast_wrapper() at line ~712 (HAKMEM_TINY_PHASE6_BOX_REFACTOR)
### Rationale:
- Previous implementation had **2 fast path checks** (double overhead)
- Box Theory (tiny_alloc_fast.inc.h) provides optimized 3-4 instruction path
- CLAUDE.md claims +64% (debug), +150% (production) with Box Theory
- Attempt to eliminate redundant checks and unify to single fast path
### Performance Results:
⚠️ **REGRESSION** - Performance decreased:
```
Baseline (old tiny_fast_alloc): 1.68M ops/s
Box Theory (unified): 1.35M ops/s (-20%)
System malloc: 8.08M ops/s (reference)
```
### Status:
🔬 **EXPERIMENTAL** - This commit documents the attempt but shows regression.
Possible issues:
1. Box Theory may need additional tuning (env vars not sufficient)
2. Refill backend may be slower than old implementation
3. TLS freelist initialization overhead
4. Missing optimizations in Box Theory integration
### Next Steps:
- Profile to identify why Box Theory is slower
- Compare refill efficiency: old vs Box Theory
- Check if TLS SLL variables are properly initialized
- Consider reverting if root cause not found
Related: LARSON_PERFORMANCE_ANALYSIS_2025_11_05.md, CLAUDE.md Phase 6-1.7
Implementation:
Ultra-lightweight CPU cycle profiling using RDTSC instruction (~10 cycles overhead).
Changes:
1. Added rdtsc() inline function for x86_64 CPU cycle counter
2. Instrumented tiny_fast_alloc(), tiny_fast_free(), tiny_fast_refill()
3. Track malloc, free, refill, and migration cycles separately
4. Profile output via HAKMEM_TINY_PROFILE=1 environment variable
5. Renamed variables to avoid conflict with core/hakmem.c globals
Files modified:
- core/tiny_fastcache.h: rdtsc(), profile helpers, extern declarations
- core/tiny_fastcache.c: counter definitions, print_profile() output
Usage:
```bash
HAKMEM_TINY_PROFILE=1 ./larson_hakmem 2 8 128 1024 1 12345 4
```
Results (Larson 4 threads, 1.637M ops/s):
```
[MALLOC] count=20,480, avg_cycles=2,476
[REFILL] count=1,285, avg_cycles=38,412 ← 15.5x slower!
[FREE] (no data - not called via fast path)
```
Critical discoveries:
1. **REFILL is the bottleneck:**
- Average 38,412 cycles per refill (15.5x slower than malloc)
- Refill accounts for: 1,285 × 38,412 = 49.3M cycles
- Despite Phase 3 batch optimization, still extremely slow
- Calling hak_tiny_alloc() 16 times has massive overhead
2. **MALLOC is 24x slower than expected:**
- Average 2,476 cycles (expected ~100 cycles for tcache)
- Even cache hits are slow
- Profiling overhead is only ~10 cycles, so real cost is ~2,466 cycles
- Something fundamentally wrong with fast path
3. **Only 2.5% of allocations use fast path:**
- Total operations: 1.637M × 2s = 3.27M ops
- Tiny fast alloc: 20,480 × 4 threads = 81,920 ops
- Coverage: 81,920 / 3,270,000 = **2.5%**
- **97.5% of allocations bypass tiny_fast_alloc entirely!**
4. **FREE is not instrumented:**
- No free() calls captured by profiling
- hakmem.c's free() likely takes different path
- Not calling tiny_fast_free() at all
Root cause analysis:
The 4x performance gap (vs system malloc) is NOT due to:
- Entry point overhead (Phase 1) ❌
- Dual free lists (Phase 2) ❌
- Batch refill efficiency (Phase 3) ❌
The REAL problems:
1. **Tiny fast path is barely used** (2.5% coverage)
2. **Refill is catastrophically slow** (38K cycles)
3. **Even cache hits are 24x too slow** (2.5K cycles)
4. **Free path is completely bypassed**
Why system malloc is 4x faster:
- System tcache has ~100 cycle malloc
- System tcache has ~90% hit rate (vs our 2.5% usage)
- System malloc/free are symmetric (we only optimize malloc)
Next steps:
1. Investigate why 97.5% bypass tiny_fast_alloc
2. Profile the slow path (hak_alloc_at) that handles 97.5%
3. Understand why even cache hits take 2,476 cycles
4. Instrument free() path to see where frees go
5. May need to optimize slow path instead of fast path
This profiling reveals we've been optimizing the wrong thing.
The "fast path" is neither fast (2.5K cycles) nor used (2.5%).
Implementation:
- Register tiny_fast_print_stats() via atexit() on first refill
- Forward declaration for function ordering
- Enable with HAKMEM_TINY_FAST_STATS=1
Usage:
```bash
HAKMEM_TINY_FAST_STATS=1 ./larson_hakmem 2 8 128 1024 1 12345 4
```
Results (threads=4, Throughput=1.377M ops/s):
- refills = 1,285 per thread
- drains = 0 (cache never full)
- Total ops = 2.754M (2 seconds)
- Refill allocations = 20,560 (1,285 × 16)
- **Refill rate: 0.75%**
- **Cache hit rate: 99.25%** ✨
Analysis:
Contrary to expectations, refill cost is NOT the bottleneck:
- Current refill cost: 1,285 × 1,600 cycles = 2.056M cycles
- Even if batched (200 cycles): saves 1.799M cycles
- But refills are only 0.75% of operations!
True bottleneck must be:
1. Fast path itself (99.25% of allocations)
- malloc() overhead despite reordering
- size_to_class mapping (even LUT has cost)
- TLS cache access pattern
2. free() path (not optimized yet)
3. Cross-thread synchronization (22.8% cycles in profiling)
Key insight:
Phase 1 (entry point optimization) and Phase 3 (batch refill)
won't help much because:
- Entry point: Fast path already hit 99.25%
- Batch refill: Only affects 0.75% of operations
Next steps:
1. Add malloc/free counters to identify which is slower
2. Consider Phase 2 (Dual Free Lists) for locality
3. Investigate free() path optimization
4. May need to profile TLS cache access patterns
Related: mimalloc research shows dual free lists reduce cache
line bouncing - this may be more important than refill cost.
Implementation: Move HAKMEM_TINY_FAST_PATH check BEFORE all guard checks
in malloc(), inspired by mimalloc/tcache entry point design.
Strategy:
- tcache has 0 branches before fast path
- mimalloc has 1-2 branches before fast path
- Old HAKMEM had 8+ branches before fast path
- Phase 1: Move fast path to line 1, add branch prediction hints
Changes in core/hakmem.c:
1. Fast Path First: Size check → Init check → Cache hit (3 branches)
2. Slow Path: All guards moved after fast path (rare cases)
3. Branch hints: __builtin_expect() for hot paths
Expected results (from research):
- ST: 0.46M → 1.4-2.3M ops/s (+204-400%)
- MT: 1.86M → 3.7-5.6M ops/s (+99-201%)
Actual results (Larson 2s 8-128B 1024):
- ST: 0.377M → 0.424M ops/s (+12% only)
- MT: 1.856M → 1.453M ops/s (-22% regression!)
Analysis:
- Similar pattern to previous Option A test (+42% ST, -20% MT)
- Entry point reordering alone is insufficient
- True bottleneck may be:
1. tiny_fast_alloc() internals (size-to-class, cache access)
2. Refill cost (1,600 cycles for 16 individual calls)
3. Need Batch Refill optimization (Phase 3) as priority
Next steps:
- Investigate refill bottleneck with perf profiling
- Consider implementing Phase 3 (Batch Refill) before Phase 2
- May need combination of multiple optimizations for breakthrough
Related: LARSON_PERFORMANCE_ANALYSIS_2025_11_05.md
Changes:
- Reorder malloc() to prioritize Fast Path (initialized + tiny size check first)
- Move Fast Path check before all guard checks (recursion, LD_PRELOAD, etc.)
- Optimize free() with same strategy (initialized check first)
- Add branch prediction hints (__builtin_expect)
Implementation:
- malloc(): Fast Path now executes with 3 branches total
- Branch 1+2: g_initialized && size <= TINY_FAST_THRESHOLD
- Branch 3: tiny_fast_alloc() cache hit check
- Slow Path: All guard checks moved after Fast Path miss
- free(): Fast Path with 1-2 branches
- Branch 1: g_initialized check
- Direct to hak_free_at() on normal case
Performance Results (Larson benchmark, size=8-128B):
Single-thread (threads=1):
- Before: 0.46M ops/s (10.7% of system malloc)
- After: 0.65M ops/s (15.4% of system malloc)
- Change: +42% improvement ✓
Multi-thread (threads=4):
- Before: 1.81M ops/s (25.0% of system malloc)
- After: 1.44M ops/s (19.9% of system malloc)
- Change: -20% regression ✗
Analysis:
- ST improvement shows Fast Path optimization works
- MT regression suggests contention or cache issues
- Did not meet target (+200-400%), further optimization needed
Next Steps:
- Investigate MT regression (cache coherency?)
- Consider more aggressive inlining
- Explore Option B (Refill optimization)