Implementation:
- 3-mode control via HAKMEM_TINY_SS_SHARED env var
- 0: Legacy only
- 1: Shared Pool + Legacy fallback
- 2: Shared Pool only (DEFAULT)
- Mode 2 returns NULL on failure (no Legacy fallback)
- 'Reversible box' design - can switch back via env var
Results:
- ✅ Legacy backend cleanly disabled
- ✅ No shared_fail→legacy in Mode 2
- ✅ Env var switching verified
Known Issues:
- TLS_SLL_DUP remains in Shared Pool backend (cls=5, 141 pointers)
- This is a Shared Pool backend internal issue, not Legacy backend
- Phase 9-3 will address root cause
Box Theory Compliance:
- Single Responsibility: Shared Pool only manages state
- Clear Contract: 3 modes clearly defined
- Observable: Debug logs show mode selection
- Composable: Instant env var switching
Performance:
- Some benchmarks may be slower (user approved)
- Stability prioritized over performance
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
**Problem**: Include order dependency prevented using TINY_FRONT_FASTCACHE_ENABLED
macro in hakmem_tiny_refill.inc.h (included before tiny_alloc_fast.inc.h).
**Solution** (from ChatGPT advice):
- Move wrapper functions to tiny_front_config_box.h as static inline
- This makes them available regardless of include order
- Enables dead code elimination in PGO mode for refill path
**Changes**:
1. core/box/tiny_front_config_box.h:
- Add tiny_fastcache_enabled() and sfc_cascade_enabled() as static inline
- These access static global variables via extern declaration
2. core/hakmem_tiny_refill.inc.h:
- Include tiny_front_config_box.h
- Use TINY_FRONT_FASTCACHE_ENABLED macro (line 162)
- Enables dead code elimination in PGO mode
3. core/tiny_alloc_fast.inc.h:
- Remove duplicate wrapper function definitions
- Now uses functions from config box header
**Performance**: 79.8M ops/s (maintained, 77M/81M/81M across 3 runs)
**Design Principle**: Config Box as "single entry point" for Tiny Front policy
- All config checks go through TINY_FRONT_*_ENABLED macros
- Wrapper functions centralized in config box header
- Include order independent (static inline in header)
🐱 Generated with ChatGPT advice for solving include order dependencies
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Explicitly guard SuperSlab validation with #if !HAKMEM_BUILD_RELEASE
to document that this code is debug-only.
Changes:
- core/tiny_region_id.h: Add #if !HAKMEM_BUILD_RELEASE guard around
hak_super_lookup() validation code (lines 199-239)
- Improves code readability: Makes debug-only intent explicit
- Self-documenting: No need to check Makefile to understand behavior
- Defensive: Works correctly even if LTO is disabled
Performance Impact:
- Measured: +1.67% (bench_random_mixed), +1.33% (bench_mid_mt_gap)
- Expected: +12-15% (based on initial perf interpretation)
- Actual: NO measurable improvement (within noise margin ±3.6%)
Root Cause (Investigation):
- Compiler (LTO) already eliminated hak_super_lookup() automatically
- The function never existed in compiled binary (verified via nm/objdump)
- Default Makefile has -DHAKMEM_BUILD_RELEASE=1 + -flto
- perf's "15.84% CPU" was misattributed (was free(), not hak_super_lookup)
Conclusion:
This change provides NO performance benefit, but IMPROVES code clarity
by making the debug-only nature explicit rather than relying on
implicit compiler optimization.
Files:
- core/tiny_region_id.h - Add explicit debug guard
- PHASE6A_DISCREPANCY_INVESTIGATION.md - Full investigation report
Lessons Learned:
1. Always verify assembly output before claiming optimizations
2. perf attribution can be misleading - cross-reference with symbols
3. LTO is extremely aggressive at dead code elimination
4. Small improvements (<2× stdev) need statistical validation
See PHASE6A_DISCREPANCY_INVESTIGATION.md for complete analysis.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Implement compile-time configuration system for dead code elimination in Tiny
allocation hot paths. The Config Box provides dual-mode configuration:
- Normal mode: Runtime ENV checks (backward compatible, flexible)
- PGO mode: Compile-time constants (dead code elimination, performance)
PERFORMANCE:
- Baseline (runtime config): 50.32 M ops/s (avg of 5 runs)
- Config Box (PGO mode): 52.77 M ops/s (avg of 5 runs)
- Improvement: +2.45 M ops/s (+4.87% with outlier, +2.72% without)
- Target: +5-8% (partially achieved)
IMPLEMENTATION:
1. core/box/tiny_front_config_box.h (NEW):
- Defines TINY_FRONT_*_ENABLED macros for all config checks
- PGO mode (#if HAKMEM_TINY_FRONT_PGO): Macros expand to constants (0/1)
- Normal mode (#else): Macros expand to function calls
- Functions remain in their original locations (no code duplication)
2. core/hakmem_build_flags.h:
- Added HAKMEM_TINY_FRONT_PGO build flag (default: 0, off)
- Documentation: Usage with make EXTRA_CFLAGS="-DHAKMEM_TINY_FRONT_PGO=1"
3. core/box/hak_wrappers.inc.h:
- Replaced front_gate_unified_enabled() with TINY_FRONT_UNIFIED_GATE_ENABLED
- 2 call sites updated (malloc and free fast paths)
- Added config box include
EXPECTED DEAD CODE ELIMINATION (PGO mode):
if (TINY_FRONT_UNIFIED_GATE_ENABLED) { ... }
→ if (1) { ... } // Constant, always true
→ Compiler optimizes away the branch, keeps body
SCOPE:
Currently only front_gate_unified_enabled() is replaced (2 call sites).
To achieve full +5-8% target, expand to other config checks:
- ultra_slim_mode_enabled()
- tiny_heap_v2_enabled()
- sfc_cascade_enabled()
- tiny_fastcache_enabled()
- tiny_metrics_enabled()
- tiny_diag_enabled()
BUILD USAGE:
Normal mode (runtime config, default):
make bench_random_mixed_hakmem
PGO mode (compile-time config, dead code elimination):
make EXTRA_CFLAGS="-DHAKMEM_TINY_FRONT_PGO=1" bench_random_mixed_hakmem
BOX PATTERN COMPLIANCE:
✅ Single Responsibility: Configuration management ONLY
✅ Clear Contract: Dual-mode (PGO = constants, Normal = runtime)
✅ Observable: Config report function (debug builds)
✅ Safe: Backward compatible (default is normal mode)
✅ Testable: Easy A/B comparison (PGO vs normal builds)
WHY +2.7-4.9% (below +5-8% target)?
- Limited scope: Only 2 call sites for 1 config function replaced
- Lazy init overhead: front_gate_unified_enabled() cached after first call
- Need to expand to more config checks for full benefit
NEXT STEPS:
- Expand config macro usage to other functions (optional)
- OR proceed with PGO re-enablement (Final polish)
🤖 Generated with Claude Code
Co-Authored-By: Claude <noreply@anthropic.com>
Root cause identified by Task agent investigation:
- superslab_allocate() called without declaration in 2 files
- Compiler assumes implicit int return type (C99 standard)
- Actual signature returns SuperSlab* (64-bit pointer)
- Pointer truncated to 32-bit int, then sign-extended to 64-bit
- Results in corrupted pointer and segmentation fault
Mechanism of corruption:
1. superslab_allocate() returns 0x00005555eba00000
2. Compiler expects int, reads only %eax: 0xeba00000
3. movslq %eax,%rbp sign-extends with bit 31 set
4. Result: 0xffffffffeba00000 (invalid pointer)
5. Dereferencing causes SEGFAULT
Files fixed:
1. hakmem_tiny_superslab_internal.h - Added box/ss_allocation_box.h
(fixes superslab_head.c via transitive include)
2. hakmem_super_registry.c - Added box/ss_allocation_box.h
Warnings eliminated:
- "implicit declaration of function 'superslab_allocate'"
- "type of 'superslab_allocate' does not match original declaration"
- "code may be misoptimized unless '-fno-strict-aliasing' is used"
Test results:
- larson_hakmem now runs without segfault ✓
- Multiple test runs confirmed stable ✓
- 2 threads, 4 threads: All passing ✓
Impact:
- CRITICAL severity bug (affects all SuperSlab expansion)
- Intermittent (depends on memory layout ~50% probability)
- Now FIXED completely
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Second freelist path identified by Task exploration agent:
- tiny_drain_freelist_to_sll_once() in hakmem_tiny_free.inc
- Activated via HAKMEM_TINY_DRAIN_TO_SLL environment variable
- Pops blocks from freelist without restoring headers
- Missing header restoration before tls_sll_push() call
Fix applied:
1. Added HEADER_MAGIC restoration before tls_sll_push()
in tiny_drain_freelist_to_sll_once() (lines 74-79)
2. Added tiny_region_id.h include for HEADER_MAGIC definition
This completes the header restoration fixes for all known
freelist → TLS SLL code paths:
1. box_carve_and_push_with_freelist() ✓ (commit 3c6c76cb1)
2. tiny_drain_freelist_to_sll_once() ✓ (this commit)
Expected result:
- Eliminates remaining 4-thread header corruption error
- All freelist blocks now have valid headers before TLS SLL push
Note: Encountered segfault in larson_hakmem during testing,
but this appears to be a pre-existing issue unrelated to
header restoration fixes (verified by testing without changes).
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
1. Archive unused backend files (ss_legacy/unified_backend_box.c/h)
- These files were not linked in the build
- Moved to archive/ to reduce confusion
2. Created HAK_RET_ALLOC_BLOCK macro for SuperSlab allocations
- Replaces superslab_return_block() function
- Consistent with existing HAK_RET_ALLOC pattern
- Single source of truth for header writing
- Defined in hakmem_tiny_superslab_internal.h
3. Added header validation on TLS SLL push
- Detects blocks pushed without proper header
- Enabled via HAKMEM_TINY_SLL_VALIDATE_HDR=1 (release)
- Always on in debug builds
- Logs first 10 violations with backtraces
Benefits:
- Easier to track allocation paths
- Catches header bugs at push time
- More maintainable macro-based design
Note: Larson bug still reproduces - header corruption occurs
before push validation can catch it.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Major refactoring to improve maintainability and debugging:
1. Split hakmem_tiny_superslab.c (1521 lines) into 7 focused files:
- superslab_allocate.c: SuperSlab allocation/deallocation
- superslab_backend.c: Backend allocation paths (legacy, shared)
- superslab_ace.c: ACE (Adaptive Cache Engine) logic
- superslab_slab.c: Slab initialization and bitmap management
- superslab_cache.c: LRU cache and prewarm cache management
- superslab_head.c: SuperSlabHead management and expansion
- superslab_stats.c: Statistics tracking and debugging
2. Created hakmem_tiny_superslab_internal.h for shared declarations
3. Added superslab_return_block() as single exit point for header writing:
- All backend allocations now go through this helper
- Prevents bugs where headers are forgotten in some paths
- Makes future debugging easier
4. Updated Makefile for new file structure
5. Added header writing to ss_legacy_backend_box.c and
ss_unified_backend_box.c (though not currently linked)
Note: Header corruption bug in Larson benchmark still exists.
Class 1-6 allocations go through TLS refill/carve paths, not backend.
Further investigation needed.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Changes:
- smallmid_is_in_range(): Add __attribute__((always_inline))
- mid_is_in_range(): Add __attribute__((always_inline))
Expected: Reduce function call overhead in Front Gate routing
Result: Neutral performance (~72M ops/s, same as Phase 1 final)
Analysis:
- Compiler was already inlining these simple functions with -O3 -flto
- 36M branches identified by perf are NOT from Front Gate routing
- Most branches are inside allocators (tiny_alloc, free, etc.)
- Front Gate optimization had minimal impact, as predicted
Next: SuperSlab size optimization (clear 3-5% benefit expected)
Files:
- core/hakmem_smallmid.h:116-119
- core/hakmem_mid_mt.h:228-231
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Problem: Large amount of debug logs in release builds causing performance
degradation in benchmarks (ChatGPT reported 0.73M ops/s vs expected 70M+).
Solution: Guard Ultra SLIM gate debug log with #if !HAKMEM_BUILD_RELEASE.
This log was printing once per thread, acceptable in debug but should be
silent in production.
Performance impact: Logs now suppressed in release builds, reducing I/O
overhead during benchmarks.
Refactored the extremely compressed line 312 (previously 600+ chars) into
properly indented, readable code while preserving identical logic:
- Broke down TLS local freelist spill operation into clear steps
- Added clarifying comment for spill operation
- Improved atomic CAS loop formatting
- No functional changes, only formatting improvements
Performance verified: 16-18M ops/s maintained (same as before)
Replace hardcoded values with named constants for better maintainability:
- ELO_MAX_CPU_NS = 100000.0 (100 microseconds)
- ELO_MAX_PAGE_FAULTS = 1000.0
- ELO_MAX_BYTES_LIVE = 100000000.0 (100 MB)
These constants define the normalization range for ELO score computation.
Moving them to file scope makes them easier to tune and document.
Performance: No change (70.1M ops/s average)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Remove obsolete comment line referencing deleted Phase E5 code.
The actual code was already removed in 2025-11-27 cleanup.
Performance: No change (69.7M ops/s)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Add unified stats/dump control that allows enabling specific stats
modules using comma-separated values or "all" to enable everything.
New file: core/hakmem_stats_master.h
- HAKMEM_STATS=all: Enable all stats modules
- HAKMEM_STATS=sfc,fast,pool: Enable specific modules
- HAKMEM_STATS_DUMP=1: Dump stats at exit
- hak_stats_check(): Check if module should enable stats
Available stats modules:
sfc, fast, heap, refill, counters, ring, invariant,
pagefault, front, pool, slim, guard, nearempty
Updated files:
- core/hakmem_tiny_sfc.c: Use hak_stats_check() for SFC stats
- core/hakmem_shared_pool.c: Use hak_stats_check() for pool stats
Performance: No regression (72.9M ops/s)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Add unified trace control that allows enabling specific trace modules
using comma-separated values or "all" to enable everything.
New file: core/hakmem_trace_master.h
- HAKMEM_TRACE=all: Enable all trace modules
- HAKMEM_TRACE=ptr,refill,free,mailbox: Enable specific modules
- HAKMEM_TRACE_LEVEL=N: Set trace verbosity (1-3)
- hak_trace_check(): Check if module should enable tracing
Available trace modules:
ptr, refill, superslab, ring, free, mailbox, registry
Priority order:
1. HAKMEM_QUIET=1 → suppress all
2. Specific module ENV (e.g., HAKMEM_PTR_TRACE=1)
3. HAKMEM_TRACE=module1,module2
4. Default → disabled
Updated files:
- core/tiny_refill.h: Use hak_trace_check() for refill tracing
- core/box/mailbox_box.c: Use hak_trace_check() for mailbox tracing
Performance: No regression (72.9M ops/s)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Add centralized debug control system that allows enabling all debug
modules at once, while maintaining backwards compatibility with
individual module ENVs.
New file: core/hakmem_debug_master.h
- HAKMEM_DEBUG_ALL=1: Enable all debug modules
- HAKMEM_DEBUG_LEVEL=N: Set debug level (0=off, 1=critical, 2=normal, 3=verbose)
- HAKMEM_QUIET=1: Suppress all debug (highest priority)
- hak_debug_check(): Check if module should enable debug
- hak_is_quiet(): Quick check for quiet mode
Priority order:
1. HAKMEM_QUIET=1 → suppress all
2. Specific module ENV (e.g., HAKMEM_SFC_DEBUG=1)
3. HAKMEM_DEBUG_ALL=1
4. HAKMEM_DEBUG_LEVEL >= threshold
Updated files:
- core/hakmem_elo.c: Use hak_is_quiet() instead of local implementation
- core/hakmem_shared_pool.c: Use hak_debug_check() for lock stats
Performance: No regression (71.5M ops/s maintained)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Critical hot path fix: hakmem_elo.c was calling getenv("HAKMEM_QUIET")
10+ times inside loops, causing 50-100μs overhead per iteration.
Fix: Cache the flag in a static variable with lazy initialization.
- Added is_quiet() helper function with __builtin_expect optimization
- Replaced all 10 inline getenv() calls with is_quiet()
- First call initializes, subsequent calls are just a branch
This is part of the ENV variable cleanup effort identified by the survey:
- Total ENV variables: 228 (target: ~80)
- getenv() calls in hot paths: CRITICAL issue
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Problem: Class 0 (8B stride) was using offset 1 for next pointer storage,
but 8B stride cannot fit [1B header][8B next pointer] - it overflows by 1 byte
into the adjacent block.
Fix: Use offset 0 for C0 (same as C7), allowing the header to be overwritten.
This is safe because:
1. class_map provides out-of-band class_idx lookup (header not needed for free)
2. P3 skips header write by default (header byte is unused anyway)
Optimization: Replace branching with bitmask lookup for zero-cost abstraction.
- Old: (class_idx == 0 || class_idx == 7) ? 0u : 1u (branch)
- New: (0x7Eu >> class_idx) & 1u (branchless)
Bit pattern: C0=0, C1-C6=1, C7=0 → 0b01111110 = 0x7E
Performance results:
- 8B: 85.19M → 85.61M (+0.5%)
- 16B: 137.43M → 147.31M (+7.2%)
- 64B: 84.21M → 84.90M (+0.8%)
Thanks to ChatGPT for spotting the g_tiny_class_sizes vs tiny_nextptr.h mismatch!
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>