Second freelist path identified by Task exploration agent:
- tiny_drain_freelist_to_sll_once() in hakmem_tiny_free.inc
- Activated via HAKMEM_TINY_DRAIN_TO_SLL environment variable
- Pops blocks from freelist without restoring headers
- Missing header restoration before tls_sll_push() call
Fix applied:
1. Added HEADER_MAGIC restoration before tls_sll_push()
in tiny_drain_freelist_to_sll_once() (lines 74-79)
2. Added tiny_region_id.h include for HEADER_MAGIC definition
This completes the header restoration fixes for all known
freelist → TLS SLL code paths:
1. box_carve_and_push_with_freelist() ✓ (commit 3c6c76cb1)
2. tiny_drain_freelist_to_sll_once() ✓ (this commit)
Expected result:
- Eliminates remaining 4-thread header corruption error
- All freelist blocks now have valid headers before TLS SLL push
Note: Encountered segfault in larson_hakmem during testing,
but this appears to be a pre-existing issue unrelated to
header restoration fixes (verified by testing without changes).
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
1. Archive unused backend files (ss_legacy/unified_backend_box.c/h)
- These files were not linked in the build
- Moved to archive/ to reduce confusion
2. Created HAK_RET_ALLOC_BLOCK macro for SuperSlab allocations
- Replaces superslab_return_block() function
- Consistent with existing HAK_RET_ALLOC pattern
- Single source of truth for header writing
- Defined in hakmem_tiny_superslab_internal.h
3. Added header validation on TLS SLL push
- Detects blocks pushed without proper header
- Enabled via HAKMEM_TINY_SLL_VALIDATE_HDR=1 (release)
- Always on in debug builds
- Logs first 10 violations with backtraces
Benefits:
- Easier to track allocation paths
- Catches header bugs at push time
- More maintainable macro-based design
Note: Larson bug still reproduces - header corruption occurs
before push validation can catch it.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Major refactoring to improve maintainability and debugging:
1. Split hakmem_tiny_superslab.c (1521 lines) into 7 focused files:
- superslab_allocate.c: SuperSlab allocation/deallocation
- superslab_backend.c: Backend allocation paths (legacy, shared)
- superslab_ace.c: ACE (Adaptive Cache Engine) logic
- superslab_slab.c: Slab initialization and bitmap management
- superslab_cache.c: LRU cache and prewarm cache management
- superslab_head.c: SuperSlabHead management and expansion
- superslab_stats.c: Statistics tracking and debugging
2. Created hakmem_tiny_superslab_internal.h for shared declarations
3. Added superslab_return_block() as single exit point for header writing:
- All backend allocations now go through this helper
- Prevents bugs where headers are forgotten in some paths
- Makes future debugging easier
4. Updated Makefile for new file structure
5. Added header writing to ss_legacy_backend_box.c and
ss_unified_backend_box.c (though not currently linked)
Note: Header corruption bug in Larson benchmark still exists.
Class 1-6 allocations go through TLS refill/carve paths, not backend.
Further investigation needed.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Changes:
- smallmid_is_in_range(): Add __attribute__((always_inline))
- mid_is_in_range(): Add __attribute__((always_inline))
Expected: Reduce function call overhead in Front Gate routing
Result: Neutral performance (~72M ops/s, same as Phase 1 final)
Analysis:
- Compiler was already inlining these simple functions with -O3 -flto
- 36M branches identified by perf are NOT from Front Gate routing
- Most branches are inside allocators (tiny_alloc, free, etc.)
- Front Gate optimization had minimal impact, as predicted
Next: SuperSlab size optimization (clear 3-5% benefit expected)
Files:
- core/hakmem_smallmid.h:116-119
- core/hakmem_mid_mt.h:228-231
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Problem: Large amount of debug logs in release builds causing performance
degradation in benchmarks (ChatGPT reported 0.73M ops/s vs expected 70M+).
Solution: Guard Ultra SLIM gate debug log with #if !HAKMEM_BUILD_RELEASE.
This log was printing once per thread, acceptable in debug but should be
silent in production.
Performance impact: Logs now suppressed in release builds, reducing I/O
overhead during benchmarks.
Refactored the extremely compressed line 312 (previously 600+ chars) into
properly indented, readable code while preserving identical logic:
- Broke down TLS local freelist spill operation into clear steps
- Added clarifying comment for spill operation
- Improved atomic CAS loop formatting
- No functional changes, only formatting improvements
Performance verified: 16-18M ops/s maintained (same as before)
Replace hardcoded values with named constants for better maintainability:
- ELO_MAX_CPU_NS = 100000.0 (100 microseconds)
- ELO_MAX_PAGE_FAULTS = 1000.0
- ELO_MAX_BYTES_LIVE = 100000000.0 (100 MB)
These constants define the normalization range for ELO score computation.
Moving them to file scope makes them easier to tune and document.
Performance: No change (70.1M ops/s average)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Remove obsolete comment line referencing deleted Phase E5 code.
The actual code was already removed in 2025-11-27 cleanup.
Performance: No change (69.7M ops/s)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Add unified stats/dump control that allows enabling specific stats
modules using comma-separated values or "all" to enable everything.
New file: core/hakmem_stats_master.h
- HAKMEM_STATS=all: Enable all stats modules
- HAKMEM_STATS=sfc,fast,pool: Enable specific modules
- HAKMEM_STATS_DUMP=1: Dump stats at exit
- hak_stats_check(): Check if module should enable stats
Available stats modules:
sfc, fast, heap, refill, counters, ring, invariant,
pagefault, front, pool, slim, guard, nearempty
Updated files:
- core/hakmem_tiny_sfc.c: Use hak_stats_check() for SFC stats
- core/hakmem_shared_pool.c: Use hak_stats_check() for pool stats
Performance: No regression (72.9M ops/s)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Add unified trace control that allows enabling specific trace modules
using comma-separated values or "all" to enable everything.
New file: core/hakmem_trace_master.h
- HAKMEM_TRACE=all: Enable all trace modules
- HAKMEM_TRACE=ptr,refill,free,mailbox: Enable specific modules
- HAKMEM_TRACE_LEVEL=N: Set trace verbosity (1-3)
- hak_trace_check(): Check if module should enable tracing
Available trace modules:
ptr, refill, superslab, ring, free, mailbox, registry
Priority order:
1. HAKMEM_QUIET=1 → suppress all
2. Specific module ENV (e.g., HAKMEM_PTR_TRACE=1)
3. HAKMEM_TRACE=module1,module2
4. Default → disabled
Updated files:
- core/tiny_refill.h: Use hak_trace_check() for refill tracing
- core/box/mailbox_box.c: Use hak_trace_check() for mailbox tracing
Performance: No regression (72.9M ops/s)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Add centralized debug control system that allows enabling all debug
modules at once, while maintaining backwards compatibility with
individual module ENVs.
New file: core/hakmem_debug_master.h
- HAKMEM_DEBUG_ALL=1: Enable all debug modules
- HAKMEM_DEBUG_LEVEL=N: Set debug level (0=off, 1=critical, 2=normal, 3=verbose)
- HAKMEM_QUIET=1: Suppress all debug (highest priority)
- hak_debug_check(): Check if module should enable debug
- hak_is_quiet(): Quick check for quiet mode
Priority order:
1. HAKMEM_QUIET=1 → suppress all
2. Specific module ENV (e.g., HAKMEM_SFC_DEBUG=1)
3. HAKMEM_DEBUG_ALL=1
4. HAKMEM_DEBUG_LEVEL >= threshold
Updated files:
- core/hakmem_elo.c: Use hak_is_quiet() instead of local implementation
- core/hakmem_shared_pool.c: Use hak_debug_check() for lock stats
Performance: No regression (71.5M ops/s maintained)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Critical hot path fix: hakmem_elo.c was calling getenv("HAKMEM_QUIET")
10+ times inside loops, causing 50-100μs overhead per iteration.
Fix: Cache the flag in a static variable with lazy initialization.
- Added is_quiet() helper function with __builtin_expect optimization
- Replaced all 10 inline getenv() calls with is_quiet()
- First call initializes, subsequent calls are just a branch
This is part of the ENV variable cleanup effort identified by the survey:
- Total ENV variables: 228 (target: ~80)
- getenv() calls in hot paths: CRITICAL issue
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Problem: Class 0 (8B stride) was using offset 1 for next pointer storage,
but 8B stride cannot fit [1B header][8B next pointer] - it overflows by 1 byte
into the adjacent block.
Fix: Use offset 0 for C0 (same as C7), allowing the header to be overwritten.
This is safe because:
1. class_map provides out-of-band class_idx lookup (header not needed for free)
2. P3 skips header write by default (header byte is unused anyway)
Optimization: Replace branching with bitmask lookup for zero-cost abstraction.
- Old: (class_idx == 0 || class_idx == 7) ? 0u : 1u (branch)
- New: (0x7Eu >> class_idx) & 1u (branchless)
Bit pattern: C0=0, C1-C6=1, C7=0 → 0b01111110 = 0x7E
Performance results:
- 8B: 85.19M → 85.61M (+0.5%)
- 16B: 137.43M → 147.31M (+7.2%)
- 64B: 84.21M → 84.90M (+0.8%)
Thanks to ChatGPT for spotting the g_tiny_class_sizes vs tiny_nextptr.h mismatch!
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Skip the 1-byte header write in tiny_region_id_write_header() when class_map
is active (default). class_map provides out-of-band class_idx lookup, making
the header byte unnecessary for the free path.
Changes:
- Add ENV-gated conditional to skip header write (default: skip)
- ENV: HAKMEM_TINY_WRITE_HEADER=1 to force header write (legacy mode)
- Memory layout preserved: user pointer = base + 1 (1B unused when skipped)
Performance improvement:
- tiny_hot 64B: 83.5M → 84.2M ops/sec (+0.8%)
- random_mixed ws=256: 68.1M → 72.2M ops/sec (+6%)
The header skip reduces one store instruction per allocation, which is
particularly beneficial for mixed-size workloads like random_mixed.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Add active field to TinySlabMeta to track blocks currently held by
users (not in TLS SLL or freelist caches). This enables accurate
empty slab detection that accounts for TLS SLL cached blocks.
Changes:
- superslab_types.h: Add _Atomic uint16_t active field
- ss_allocation_box.c, hakmem_tiny_superslab.c: Initialize active=0
- tiny_free_fast_v2.inc.h: Decrement active on TLS SLL push
- tiny_alloc_fast.inc.h: Add tiny_active_track_alloc() helper,
increment active on TLS SLL pop (all code paths)
- ss_hot_cold_box.h: ss_is_slab_empty() uses active when enabled
All tracking is ENV-gated: HAKMEM_TINY_ACTIVE_TRACK=1 to enable.
Default is off for zero performance impact.
Invariant: active = used - tls_cached (active <= used)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Gate the shared pool acquire debug variable behind #if !HAKMEM_BUILD_RELEASE:
- HAKMEM_SS_ACQUIRE_DEBUG: Controls shared pool acquisition stage tracing
- File: core/hakmem_shared_pool.c:780-788
The debug output was already gated inside #if !HAKMEM_BUILD_RELEASE blocks.
This change only gates the ENV check itself. In release builds, sets
dbg_acquire to constant 0, allowing compiler to optimize away checks.
Performance: 31.1M ops/s (+2% vs baseline)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Gate the HeapV2 push debug logging behind #if !HAKMEM_BUILD_RELEASE:
- HAKMEM_TINY_HEAP_V2_DEBUG: Controls magazine push event tracing
- File: core/front/tiny_heap_v2.h:117-130
Wraps the ENV check and debug output that logs the first 5 push
operations per size class for HeapV2 magazine diagnostics.
Performance: 29.6M ops/s (within baseline range)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Gate the fast cache debug system behind #if !HAKMEM_BUILD_RELEASE:
- HAKMEM_TINY_FAST_DEBUG: Enable/disable fastcache event logging
- HAKMEM_TINY_FAST_DEBUG_MAX: Limit number of debug messages per class
- File: core/hakmem_tiny_fastcache.inc.h:48-76
Both variables combined in single gate since they work together as a
debug logging subsystem. In release builds, provides no-op inline stub.
Performance: 30.5M ops/s (baseline maintained)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Wrap debug functionality in !HAKMEM_BUILD_RELEASE guard with no-op stubs
for release builds. This eliminates getenv() calls for HAKMEM_TINY_ALLOC_DEBUG
in production while maintaining API compatibility.
Performance: 30.0M ops/s (baseline: 30.2M)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Problem:
- Larson benchmark showed 730K ops/s instead of expected 26M ops/s
- Class 1 TLS SLL cache always stayed empty (tls_count=0)
- All allocations went through slow path (shared_pool_acquire_slab at 48% CPU)
Root cause:
- In sll_refill_small_from_ss(), when TLS was completely uninitialized
(ss=NULL, meta=NULL, slab_base=NULL), the function returned 0 immediately
without calling superslab_refill() to initialize it
- The comment said "expect upper logic to call superslab_refill" but
tiny_alloc_fast_refill() did NOT call it after receiving 0
- This created a loop: TLS SLL stays empty → refill returns 0 → slow path
Fix:
- Remove the tls_uninitialized early return
- Let the existing downstream condition (!tls->ss || !tls->meta || ...)
handle the uninitialized case and call superslab_refill()
Result:
- Throughput: 730K → 26.5M ops/s (36x improvement)
- shared_pool_acquire_slab: 48% → 0% in perf profile
Introduced in: fcf098857 (Phase12 debug, 2025-11-14)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
When superslab_refill() fails in the inner loop, tls->ss can remain
NULL even when produced > 0 (from earlier successful allocations).
This caused a segfault at high iteration counts (>500K) in the
random_mixed benchmark.
Root cause: Line 353 calls ss_active_add(tls->ss, ...) without
checking if tls->ss is NULL after a failed refill breaks the loop.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Safety fix: ss_fast_lookup masks pointer to 1MB boundary and reads
memory at that address. If called with arbitrary (non-Tiny) pointers,
the masked address could be unmapped → SEGFAULT.
Changes:
- tiny_free_fast(): Reverted to safe hak_super_lookup (can receive
arbitrary pointers without prior validation)
- ss_fast_lookup(): Added safety warning in comments documenting when
it's safe to use (after header magic 0xA0 validation)
ss_fast_lookup remains in LARSON_FIX paths where header magic is
already validated before the SuperSlab lookup.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Replaces expensive hak_super_lookup() (registry hash lookup, 50-100 cycles)
with fast mask-based lookup (~5-10 cycles) in free hot paths.
Algorithm:
1. Mask pointer with SUPERSLAB_SIZE_MIN (1MB) - works for both 1MB and 2MB SS
2. Validate magic (SUPERSLAB_MAGIC)
3. Range check using ss->lg_size
Applied to:
- tiny_free_fast.inc.h: tiny_free_fast() SuperSlab path
- tiny_free_fast_v2.inc.h: LARSON_FIX cross-thread check
- front/malloc_tiny_fast.h: free_tiny_fast() LARSON_FIX path
Note: Performance impact minimal with LARSON_FIX=OFF (default) since
SuperSlab lookup is skipped entirely in that case. Optimization benefits
LARSON_FIX=ON path for safe multi-threaded operation.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
The allocation logging at line 236-249 was missing the
#if !HAKMEM_BUILD_RELEASE guard, causing fprintf(stderr)
on every allocation even in release builds.
Impact: 19.8M ops/s → 28.0M ops/s (+42%)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
**Diagnostic Enhancement**: Complete malloc/free/pop operation tracing for debug
**Problem**: Larson crashes with TLS_SLL_DUP at count=18, need to trace exact
pointer lifecycle to identify if allocator returns duplicate addresses or if
benchmark has double-free bug.
**Implementation** (ChatGPT + Claude + Task collaboration):
1. **Global Operation Counter** (core/hakmem_tiny_config_box.inc:9):
- Single atomic counter for all operations (malloc/free/pop)
- Chronological ordering across all paths
2. **Allocation Logging** (core/hakmem_tiny_config_box.inc:148-161):
- HAK_RET_ALLOC macro enhanced with operation logging
- Logs first 50 class=1 allocations with ptr/base/tls_count
3. **Free Logging** (core/tiny_free_fast_v2.inc.h:222-235):
- Added before tls_sll_push() call (line 221)
- Logs first 50 class=1 frees with ptr/base/tls_count_before
4. **Pop Logging** (core/box/tls_sll_box.h:587-597):
- Added in tls_sll_pop_impl() after successful pop
- Logs first 50 class=1 pops with base/tls_count_after
5. **Drain Debug Logging** (core/box/tls_sll_drain_box.h:143-151):
- Enhanced drain loop with detailed logging
- Tracks pop failures and drained block counts
**Initial Findings**:
- First 19 operations: ALL frees, ZERO allocations, ZERO pops
- OP#0006: First free of 0x...430
- OP#0018: Duplicate free of 0x...430 → TLS_SLL_DUP detected
- Suggests either: (a) allocations before logging starts, or (b) Larson bug
**Debug-only**: All logging gated by !HAKMEM_BUILD_RELEASE (zero cost in release)
**Next Steps**:
- Expand logging window to 200 operations
- Log initialization phase allocations
- Cross-check with Larson benchmark source
**Status**: Ready for extended testing
**Problem**: Larson benchmark crashes with TLS_SLL_DUP (double-free), 100% crash rate in debug
**Root Cause**: TLS drain pushback code (commit c2f104618) created duplicates by
pushing pointers back to TLS SLL while they were still in the linked list chain.
**Diagnostic Enhancements** (ChatGPT + Claude collaboration):
1. **Callsite Tracking**: Track file:line for each TLS SLL push (debug only)
- Arrays: g_tls_sll_push_file[], g_tls_sll_push_line[]
- Macro: tls_sll_push() auto-records __FILE__, __LINE__
2. **Enhanced Duplicate Detection**:
- Scan depth: 64 → 256 nodes (deep duplicate detection)
- Error message shows BOTH current and previous push locations
- Calls ptr_trace_dump_now() for detailed analysis
3. **Evidence Captured**:
- Both duplicate pushes from same line (221)
- Pointer at position 11 in TLS SLL (count=18, scanned=11)
- Confirms pointer allocated without being popped from TLS SLL
**Fix**:
- **core/box/tls_sll_drain_box.h**: Remove pushback code entirely
- Old: Push back to TLS SLL on validation failure → duplicates!
- New: Skip pointer (accept rare leak) to avoid duplicates
- Rationale: SuperSlab lookup failures are transient/rare
**Status**: Fix implemented, ready for testing
**Updated**:
- LARSON_DOUBLE_FREE_INVESTIGATION.md: Root cause confirmed