Gate the fast cache debug system behind #if !HAKMEM_BUILD_RELEASE:
- HAKMEM_TINY_FAST_DEBUG: Enable/disable fastcache event logging
- HAKMEM_TINY_FAST_DEBUG_MAX: Limit number of debug messages per class
- File: core/hakmem_tiny_fastcache.inc.h:48-76
Both variables combined in single gate since they work together as a
debug logging subsystem. In release builds, provides no-op inline stub.
Performance: 30.5M ops/s (baseline maintained)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Wrap debug functionality in !HAKMEM_BUILD_RELEASE guard with no-op stubs
for release builds. This eliminates getenv() calls for HAKMEM_TINY_ALLOC_DEBUG
in production while maintaining API compatibility.
Performance: 30.0M ops/s (baseline: 30.2M)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Problem:
- Larson benchmark showed 730K ops/s instead of expected 26M ops/s
- Class 1 TLS SLL cache always stayed empty (tls_count=0)
- All allocations went through slow path (shared_pool_acquire_slab at 48% CPU)
Root cause:
- In sll_refill_small_from_ss(), when TLS was completely uninitialized
(ss=NULL, meta=NULL, slab_base=NULL), the function returned 0 immediately
without calling superslab_refill() to initialize it
- The comment said "expect upper logic to call superslab_refill" but
tiny_alloc_fast_refill() did NOT call it after receiving 0
- This created a loop: TLS SLL stays empty → refill returns 0 → slow path
Fix:
- Remove the tls_uninitialized early return
- Let the existing downstream condition (!tls->ss || !tls->meta || ...)
handle the uninitialized case and call superslab_refill()
Result:
- Throughput: 730K → 26.5M ops/s (36x improvement)
- shared_pool_acquire_slab: 48% → 0% in perf profile
Introduced in: fcf098857 (Phase12 debug, 2025-11-14)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Output now shows: Throughput = XXX ops/s [iter=N ws=M] time=Xs
This prevents confusion when comparing results measured with different
workset sizes (e.g., ws=256 gives 67M ops/s vs ws=8192 gives 18M ops/s).
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
When superslab_refill() fails in the inner loop, tls->ss can remain
NULL even when produced > 0 (from earlier successful allocations).
This caused a segfault at high iteration counts (>500K) in the
random_mixed benchmark.
Root cause: Line 353 calls ss_active_add(tls->ss, ...) without
checking if tls->ss is NULL after a failed refill breaks the loop.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Safety fix: ss_fast_lookup masks pointer to 1MB boundary and reads
memory at that address. If called with arbitrary (non-Tiny) pointers,
the masked address could be unmapped → SEGFAULT.
Changes:
- tiny_free_fast(): Reverted to safe hak_super_lookup (can receive
arbitrary pointers without prior validation)
- ss_fast_lookup(): Added safety warning in comments documenting when
it's safe to use (after header magic 0xA0 validation)
ss_fast_lookup remains in LARSON_FIX paths where header magic is
already validated before the SuperSlab lookup.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Replaces expensive hak_super_lookup() (registry hash lookup, 50-100 cycles)
with fast mask-based lookup (~5-10 cycles) in free hot paths.
Algorithm:
1. Mask pointer with SUPERSLAB_SIZE_MIN (1MB) - works for both 1MB and 2MB SS
2. Validate magic (SUPERSLAB_MAGIC)
3. Range check using ss->lg_size
Applied to:
- tiny_free_fast.inc.h: tiny_free_fast() SuperSlab path
- tiny_free_fast_v2.inc.h: LARSON_FIX cross-thread check
- front/malloc_tiny_fast.h: free_tiny_fast() LARSON_FIX path
Note: Performance impact minimal with LARSON_FIX=OFF (default) since
SuperSlab lookup is skipped entirely in that case. Optimization benefits
LARSON_FIX=ON path for safe multi-threaded operation.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
The allocation logging at line 236-249 was missing the
#if !HAKMEM_BUILD_RELEASE guard, causing fprintf(stderr)
on every allocation even in release builds.
Impact: 19.8M ops/s → 28.0M ops/s (+42%)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
**Diagnostic Enhancement**: Complete malloc/free/pop operation tracing for debug
**Problem**: Larson crashes with TLS_SLL_DUP at count=18, need to trace exact
pointer lifecycle to identify if allocator returns duplicate addresses or if
benchmark has double-free bug.
**Implementation** (ChatGPT + Claude + Task collaboration):
1. **Global Operation Counter** (core/hakmem_tiny_config_box.inc:9):
- Single atomic counter for all operations (malloc/free/pop)
- Chronological ordering across all paths
2. **Allocation Logging** (core/hakmem_tiny_config_box.inc:148-161):
- HAK_RET_ALLOC macro enhanced with operation logging
- Logs first 50 class=1 allocations with ptr/base/tls_count
3. **Free Logging** (core/tiny_free_fast_v2.inc.h:222-235):
- Added before tls_sll_push() call (line 221)
- Logs first 50 class=1 frees with ptr/base/tls_count_before
4. **Pop Logging** (core/box/tls_sll_box.h:587-597):
- Added in tls_sll_pop_impl() after successful pop
- Logs first 50 class=1 pops with base/tls_count_after
5. **Drain Debug Logging** (core/box/tls_sll_drain_box.h:143-151):
- Enhanced drain loop with detailed logging
- Tracks pop failures and drained block counts
**Initial Findings**:
- First 19 operations: ALL frees, ZERO allocations, ZERO pops
- OP#0006: First free of 0x...430
- OP#0018: Duplicate free of 0x...430 → TLS_SLL_DUP detected
- Suggests either: (a) allocations before logging starts, or (b) Larson bug
**Debug-only**: All logging gated by !HAKMEM_BUILD_RELEASE (zero cost in release)
**Next Steps**:
- Expand logging window to 200 operations
- Log initialization phase allocations
- Cross-check with Larson benchmark source
**Status**: Ready for extended testing
**Problem**: Larson benchmark crashes with TLS_SLL_DUP (double-free), 100% crash rate in debug
**Root Cause**: TLS drain pushback code (commit c2f104618) created duplicates by
pushing pointers back to TLS SLL while they were still in the linked list chain.
**Diagnostic Enhancements** (ChatGPT + Claude collaboration):
1. **Callsite Tracking**: Track file:line for each TLS SLL push (debug only)
- Arrays: g_tls_sll_push_file[], g_tls_sll_push_line[]
- Macro: tls_sll_push() auto-records __FILE__, __LINE__
2. **Enhanced Duplicate Detection**:
- Scan depth: 64 → 256 nodes (deep duplicate detection)
- Error message shows BOTH current and previous push locations
- Calls ptr_trace_dump_now() for detailed analysis
3. **Evidence Captured**:
- Both duplicate pushes from same line (221)
- Pointer at position 11 in TLS SLL (count=18, scanned=11)
- Confirms pointer allocated without being popped from TLS SLL
**Fix**:
- **core/box/tls_sll_drain_box.h**: Remove pushback code entirely
- Old: Push back to TLS SLL on validation failure → duplicates!
- New: Skip pointer (accept rare leak) to avoid duplicates
- Rationale: SuperSlab lookup failures are transient/rare
**Status**: Fix implemented, ready for testing
**Updated**:
- LARSON_DOUBLE_FREE_INVESTIGATION.md: Root cause confirmed
Added weak stubs to core/link_stubs.c for symbols that are not needed
in HAKMEM_FORCE_LIBC_ALLOC_BUILD=1 (TSan/ASan) builds:
Stubs added:
- g_bump_chunk (int)
- g_tls_bcur, g_tls_bend (__thread uint8_t*[8])
- smallmid_backend_free()
- expand_superslab_head()
Also added: #include <stdint.h> for uint8_t
Impact:
- TSan build: PASS (larson_hakmem_tsan successfully built)
- Phase 2 ready: Can now use TSan to debug Larson crashes
Next: Use TSan to investigate Larson 47% crash rate
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Optimized HAKMEM_SFC_DEBUG environment variable handling by caching
the value at initialization instead of repeated getenv() calls in
hot paths.
Changes:
1. Added g_sfc_debug global variable (core/hakmem_tiny_sfc.c)
- Initialized once in sfc_init() by reading HAKMEM_SFC_DEBUG
- Single source of truth for SFC debug state
2. Declared g_sfc_debug as extern (core/hakmem_tiny_config.h)
- Available to all modules that need SFC debug checks
3. Replaced getenv() with g_sfc_debug in hot paths:
- core/tiny_alloc_fast_sfc.inc.h (allocation path)
- core/tiny_free_fast.inc.h (free path)
- core/box/hak_wrappers.inc.h (wrapper layer)
Impact:
- getenv() calls: 7 → 1 (86% reduction)
- Hot-path calls eliminated: 6 (all moved to init-time)
- Performance: 15.10M ops/s (stable, 0% CV)
- Build: Clean compilation, no new warnings
Testing:
- 10 runs of 100K iterations: consistent performance
- Symbol verification: g_sfc_debug present in hakmem_tiny_sfc.o
- No regression detected
Note: 3 additional getenv("HAKMEM_SFC_DEBUG") calls exist in
hakmem_tiny_ultra_simple.inc but are dead code (file not compiled
in current build configuration).
Files modified: 5 core files
Status: Production-ready, all tests passed
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Created comprehensive record of all optimization commits that led to
80M ops/s Random Mixed performance on master branch:
- UNIFIED-HEADER (472b6a60b): +17% improvement
- Bug fixes (d26519f67): +15-41% improvement
- tiny_get_max_size inline (e81fe783d): +2M ops/s
- Min-Keep policy (04a60c316): +1M ops/s
- Unified Cache tuning (392d29018): +1M ops/s
- Stage 1 lock-free (dcd89ee88): +0.3M ops/s
- Shared Pool optimizations: +2-3M ops/s
Also documents:
- Step 2.5 bug (19c1abfe7) causing 100% Larson crash
- Architecture differences (E1-CORRECT vs UNIFIED-HEADER)
- Current state comparison between master and larson-master-rebuild
- Three future options with trade-offs
This documentation preserves knowledge of the 80M optimization path
for future reference when larson-master-rebuild becomes new master.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
- Policy: Set tiny_min_keep for C2-C6 to reduce mmap/munmap churn
- Policy: Loosen tiny_cap (soft cap) for C4-C6 to allow more active slots
- Added tiny_min_keep field to FrozenPolicy struct
Larson: 52.13M ops/s (stable)
🤖 Generated with Claude Code
Co-Authored-By: Claude <noreply@anthropic.com>
- Move tiny_get_max_size to header for inlining
- Use cached static variable to avoid repeated env lookup
- Larson: 51.99M ops/s (stable)
🤖 Generated with Claude Code
Co-Authored-By: Claude <noreply@anthropic.com>