Root cause identified by Task agent investigation:
- superslab_allocate() called without declaration in 2 files
- Compiler assumes implicit int return type (C99 standard)
- Actual signature returns SuperSlab* (64-bit pointer)
- Pointer truncated to 32-bit int, then sign-extended to 64-bit
- Results in corrupted pointer and segmentation fault
Mechanism of corruption:
1. superslab_allocate() returns 0x00005555eba00000
2. Compiler expects int, reads only %eax: 0xeba00000
3. movslq %eax,%rbp sign-extends with bit 31 set
4. Result: 0xffffffffeba00000 (invalid pointer)
5. Dereferencing causes SEGFAULT
Files fixed:
1. hakmem_tiny_superslab_internal.h - Added box/ss_allocation_box.h
(fixes superslab_head.c via transitive include)
2. hakmem_super_registry.c - Added box/ss_allocation_box.h
Warnings eliminated:
- "implicit declaration of function 'superslab_allocate'"
- "type of 'superslab_allocate' does not match original declaration"
- "code may be misoptimized unless '-fno-strict-aliasing' is used"
Test results:
- larson_hakmem now runs without segfault ✓
- Multiple test runs confirmed stable ✓
- 2 threads, 4 threads: All passing ✓
Impact:
- CRITICAL severity bug (affects all SuperSlab expansion)
- Intermittent (depends on memory layout ~50% probability)
- Now FIXED completely
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Second freelist path identified by Task exploration agent:
- tiny_drain_freelist_to_sll_once() in hakmem_tiny_free.inc
- Activated via HAKMEM_TINY_DRAIN_TO_SLL environment variable
- Pops blocks from freelist without restoring headers
- Missing header restoration before tls_sll_push() call
Fix applied:
1. Added HEADER_MAGIC restoration before tls_sll_push()
in tiny_drain_freelist_to_sll_once() (lines 74-79)
2. Added tiny_region_id.h include for HEADER_MAGIC definition
This completes the header restoration fixes for all known
freelist → TLS SLL code paths:
1. box_carve_and_push_with_freelist() ✓ (commit 3c6c76cb1)
2. tiny_drain_freelist_to_sll_once() ✓ (this commit)
Expected result:
- Eliminates remaining 4-thread header corruption error
- All freelist blocks now have valid headers before TLS SLL push
Note: Encountered segfault in larson_hakmem during testing,
but this appears to be a pre-existing issue unrelated to
header restoration fixes (verified by testing without changes).
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
1. Archive unused backend files (ss_legacy/unified_backend_box.c/h)
- These files were not linked in the build
- Moved to archive/ to reduce confusion
2. Created HAK_RET_ALLOC_BLOCK macro for SuperSlab allocations
- Replaces superslab_return_block() function
- Consistent with existing HAK_RET_ALLOC pattern
- Single source of truth for header writing
- Defined in hakmem_tiny_superslab_internal.h
3. Added header validation on TLS SLL push
- Detects blocks pushed without proper header
- Enabled via HAKMEM_TINY_SLL_VALIDATE_HDR=1 (release)
- Always on in debug builds
- Logs first 10 violations with backtraces
Benefits:
- Easier to track allocation paths
- Catches header bugs at push time
- More maintainable macro-based design
Note: Larson bug still reproduces - header corruption occurs
before push validation can catch it.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Major refactoring to improve maintainability and debugging:
1. Split hakmem_tiny_superslab.c (1521 lines) into 7 focused files:
- superslab_allocate.c: SuperSlab allocation/deallocation
- superslab_backend.c: Backend allocation paths (legacy, shared)
- superslab_ace.c: ACE (Adaptive Cache Engine) logic
- superslab_slab.c: Slab initialization and bitmap management
- superslab_cache.c: LRU cache and prewarm cache management
- superslab_head.c: SuperSlabHead management and expansion
- superslab_stats.c: Statistics tracking and debugging
2. Created hakmem_tiny_superslab_internal.h for shared declarations
3. Added superslab_return_block() as single exit point for header writing:
- All backend allocations now go through this helper
- Prevents bugs where headers are forgotten in some paths
- Makes future debugging easier
4. Updated Makefile for new file structure
5. Added header writing to ss_legacy_backend_box.c and
ss_unified_backend_box.c (though not currently linked)
Note: Header corruption bug in Larson benchmark still exists.
Class 1-6 allocations go through TLS refill/carve paths, not backend.
Further investigation needed.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Changes:
- smallmid_is_in_range(): Add __attribute__((always_inline))
- mid_is_in_range(): Add __attribute__((always_inline))
Expected: Reduce function call overhead in Front Gate routing
Result: Neutral performance (~72M ops/s, same as Phase 1 final)
Analysis:
- Compiler was already inlining these simple functions with -O3 -flto
- 36M branches identified by perf are NOT from Front Gate routing
- Most branches are inside allocators (tiny_alloc, free, etc.)
- Front Gate optimization had minimal impact, as predicted
Next: SuperSlab size optimization (clear 3-5% benefit expected)
Files:
- core/hakmem_smallmid.h:116-119
- core/hakmem_mid_mt.h:228-231
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Problem: Large amount of debug logs in release builds causing performance
degradation in benchmarks (ChatGPT reported 0.73M ops/s vs expected 70M+).
Solution: Guard Ultra SLIM gate debug log with #if !HAKMEM_BUILD_RELEASE.
This log was printing once per thread, acceptable in debug but should be
silent in production.
Performance impact: Logs now suppressed in release builds, reducing I/O
overhead during benchmarks.
Refactored the extremely compressed line 312 (previously 600+ chars) into
properly indented, readable code while preserving identical logic:
- Broke down TLS local freelist spill operation into clear steps
- Added clarifying comment for spill operation
- Improved atomic CAS loop formatting
- No functional changes, only formatting improvements
Performance verified: 16-18M ops/s maintained (same as before)
Replace hardcoded values with named constants for better maintainability:
- ELO_MAX_CPU_NS = 100000.0 (100 microseconds)
- ELO_MAX_PAGE_FAULTS = 1000.0
- ELO_MAX_BYTES_LIVE = 100000000.0 (100 MB)
These constants define the normalization range for ELO score computation.
Moving them to file scope makes them easier to tune and document.
Performance: No change (70.1M ops/s average)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Remove obsolete comment line referencing deleted Phase E5 code.
The actual code was already removed in 2025-11-27 cleanup.
Performance: No change (69.7M ops/s)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Add unified stats/dump control that allows enabling specific stats
modules using comma-separated values or "all" to enable everything.
New file: core/hakmem_stats_master.h
- HAKMEM_STATS=all: Enable all stats modules
- HAKMEM_STATS=sfc,fast,pool: Enable specific modules
- HAKMEM_STATS_DUMP=1: Dump stats at exit
- hak_stats_check(): Check if module should enable stats
Available stats modules:
sfc, fast, heap, refill, counters, ring, invariant,
pagefault, front, pool, slim, guard, nearempty
Updated files:
- core/hakmem_tiny_sfc.c: Use hak_stats_check() for SFC stats
- core/hakmem_shared_pool.c: Use hak_stats_check() for pool stats
Performance: No regression (72.9M ops/s)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Add unified trace control that allows enabling specific trace modules
using comma-separated values or "all" to enable everything.
New file: core/hakmem_trace_master.h
- HAKMEM_TRACE=all: Enable all trace modules
- HAKMEM_TRACE=ptr,refill,free,mailbox: Enable specific modules
- HAKMEM_TRACE_LEVEL=N: Set trace verbosity (1-3)
- hak_trace_check(): Check if module should enable tracing
Available trace modules:
ptr, refill, superslab, ring, free, mailbox, registry
Priority order:
1. HAKMEM_QUIET=1 → suppress all
2. Specific module ENV (e.g., HAKMEM_PTR_TRACE=1)
3. HAKMEM_TRACE=module1,module2
4. Default → disabled
Updated files:
- core/tiny_refill.h: Use hak_trace_check() for refill tracing
- core/box/mailbox_box.c: Use hak_trace_check() for mailbox tracing
Performance: No regression (72.9M ops/s)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Add centralized debug control system that allows enabling all debug
modules at once, while maintaining backwards compatibility with
individual module ENVs.
New file: core/hakmem_debug_master.h
- HAKMEM_DEBUG_ALL=1: Enable all debug modules
- HAKMEM_DEBUG_LEVEL=N: Set debug level (0=off, 1=critical, 2=normal, 3=verbose)
- HAKMEM_QUIET=1: Suppress all debug (highest priority)
- hak_debug_check(): Check if module should enable debug
- hak_is_quiet(): Quick check for quiet mode
Priority order:
1. HAKMEM_QUIET=1 → suppress all
2. Specific module ENV (e.g., HAKMEM_SFC_DEBUG=1)
3. HAKMEM_DEBUG_ALL=1
4. HAKMEM_DEBUG_LEVEL >= threshold
Updated files:
- core/hakmem_elo.c: Use hak_is_quiet() instead of local implementation
- core/hakmem_shared_pool.c: Use hak_debug_check() for lock stats
Performance: No regression (71.5M ops/s maintained)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Critical hot path fix: hakmem_elo.c was calling getenv("HAKMEM_QUIET")
10+ times inside loops, causing 50-100μs overhead per iteration.
Fix: Cache the flag in a static variable with lazy initialization.
- Added is_quiet() helper function with __builtin_expect optimization
- Replaced all 10 inline getenv() calls with is_quiet()
- First call initializes, subsequent calls are just a branch
This is part of the ENV variable cleanup effort identified by the survey:
- Total ENV variables: 228 (target: ~80)
- getenv() calls in hot paths: CRITICAL issue
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Problem: Class 0 (8B stride) was using offset 1 for next pointer storage,
but 8B stride cannot fit [1B header][8B next pointer] - it overflows by 1 byte
into the adjacent block.
Fix: Use offset 0 for C0 (same as C7), allowing the header to be overwritten.
This is safe because:
1. class_map provides out-of-band class_idx lookup (header not needed for free)
2. P3 skips header write by default (header byte is unused anyway)
Optimization: Replace branching with bitmask lookup for zero-cost abstraction.
- Old: (class_idx == 0 || class_idx == 7) ? 0u : 1u (branch)
- New: (0x7Eu >> class_idx) & 1u (branchless)
Bit pattern: C0=0, C1-C6=1, C7=0 → 0b01111110 = 0x7E
Performance results:
- 8B: 85.19M → 85.61M (+0.5%)
- 16B: 137.43M → 147.31M (+7.2%)
- 64B: 84.21M → 84.90M (+0.8%)
Thanks to ChatGPT for spotting the g_tiny_class_sizes vs tiny_nextptr.h mismatch!
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Skip the 1-byte header write in tiny_region_id_write_header() when class_map
is active (default). class_map provides out-of-band class_idx lookup, making
the header byte unnecessary for the free path.
Changes:
- Add ENV-gated conditional to skip header write (default: skip)
- ENV: HAKMEM_TINY_WRITE_HEADER=1 to force header write (legacy mode)
- Memory layout preserved: user pointer = base + 1 (1B unused when skipped)
Performance improvement:
- tiny_hot 64B: 83.5M → 84.2M ops/sec (+0.8%)
- random_mixed ws=256: 68.1M → 72.2M ops/sec (+6%)
The header skip reduces one store instruction per allocation, which is
particularly beneficial for mixed-size workloads like random_mixed.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Add active field to TinySlabMeta to track blocks currently held by
users (not in TLS SLL or freelist caches). This enables accurate
empty slab detection that accounts for TLS SLL cached blocks.
Changes:
- superslab_types.h: Add _Atomic uint16_t active field
- ss_allocation_box.c, hakmem_tiny_superslab.c: Initialize active=0
- tiny_free_fast_v2.inc.h: Decrement active on TLS SLL push
- tiny_alloc_fast.inc.h: Add tiny_active_track_alloc() helper,
increment active on TLS SLL pop (all code paths)
- ss_hot_cold_box.h: ss_is_slab_empty() uses active when enabled
All tracking is ENV-gated: HAKMEM_TINY_ACTIVE_TRACK=1 to enable.
Default is off for zero performance impact.
Invariant: active = used - tls_cached (active <= used)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Gate the shared pool acquire debug variable behind #if !HAKMEM_BUILD_RELEASE:
- HAKMEM_SS_ACQUIRE_DEBUG: Controls shared pool acquisition stage tracing
- File: core/hakmem_shared_pool.c:780-788
The debug output was already gated inside #if !HAKMEM_BUILD_RELEASE blocks.
This change only gates the ENV check itself. In release builds, sets
dbg_acquire to constant 0, allowing compiler to optimize away checks.
Performance: 31.1M ops/s (+2% vs baseline)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Gate the HeapV2 push debug logging behind #if !HAKMEM_BUILD_RELEASE:
- HAKMEM_TINY_HEAP_V2_DEBUG: Controls magazine push event tracing
- File: core/front/tiny_heap_v2.h:117-130
Wraps the ENV check and debug output that logs the first 5 push
operations per size class for HeapV2 magazine diagnostics.
Performance: 29.6M ops/s (within baseline range)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Gate the fast cache debug system behind #if !HAKMEM_BUILD_RELEASE:
- HAKMEM_TINY_FAST_DEBUG: Enable/disable fastcache event logging
- HAKMEM_TINY_FAST_DEBUG_MAX: Limit number of debug messages per class
- File: core/hakmem_tiny_fastcache.inc.h:48-76
Both variables combined in single gate since they work together as a
debug logging subsystem. In release builds, provides no-op inline stub.
Performance: 30.5M ops/s (baseline maintained)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Wrap debug functionality in !HAKMEM_BUILD_RELEASE guard with no-op stubs
for release builds. This eliminates getenv() calls for HAKMEM_TINY_ALLOC_DEBUG
in production while maintaining API compatibility.
Performance: 30.0M ops/s (baseline: 30.2M)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>