Major breakthrough: sh8bench now completes without SIGSEGV!
Added defensive refcounting and failsafe mechanisms to prevent
use-after-free and corruption propagation.
Changes:
1. SuperSlab Refcount Pinning (core/box/tls_sll_box.h)
- tls_sll_push_impl: increment refcount before adding to list
- tls_sll_pop_impl: decrement refcount when removing from list
- Prevents SuperSlab from being freed while TLS SLL holds pointers
2. SuperSlab Release Guards (core/superslab_allocate.c, shared_pool_release.c)
- Check refcount > 0 before freeing SuperSlab
- If refcount > 0, defer release instead of freeing
- Prevents use-after-free when TLS/remote/freelist hold stale pointers
3. TLS SLL Next Pointer Validation (core/box/tls_sll_box.h)
- Detect invalid next pointer during traversal
- Log [TLS_SLL_NEXT_INVALID] when detected
- Drop list to prevent corruption propagation
4. Unified Cache Freelist Validation (core/front/tiny_unified_cache.c)
- Validate freelist head before use
- Log [UNIFIED_FREELIST_INVALID] for corrupted lists
- Defensive drop to prevent bad allocations
5. Early Refcount Decrement Fix (core/tiny_free_fast.inc.h)
- Removed ss_active_dec_one from fast path
- Prevents premature refcount depletion
- Defers decrement to proper cleanup path
Test Results:
✅ sh8bench completes successfully (exit code 0)
✅ No SIGSEGV or ABORT signals
✅ Short runs (5s) crash-free
⚠️ Multiple [TLS_SLL_NEXT_INVALID] / [UNIFIED_FREELIST_INVALID] logged
⚠️ Invalid pointers still present (stale references exist)
Status Analysis:
- Stability: ACHIEVED (no crashes)
- Root Cause: NOT FULLY SOLVED (invalid pointers remain)
- Approach: Defensive + refcount guards working well
Remaining Issues:
❌ Why does SuperSlab get unregistered while TLS SLL holds pointers?
❌ SuperSlab lifecycle: remote_queue / adopt / LRU interactions?
❌ Stale pointers indicate improper SuperSlab lifetime management
Performance Impact:
- Refcount operations: +1-3 cycles per push/pop (minor)
- Validation checks: +2-5 cycles (minor)
- Overall: < 5% overhead estimated
Next Investigation:
- Trace SuperSlab lifecycle (allocation → registration → unregister → free)
- Check remote_queue handling
- Verify adopt/LRU mechanisms
- Correlate stale pointer logs with SuperSlab unregister events
Log Volume Warning:
- May produce many diagnostic logs on long runs
- Consider ENV gating for production
Technical Notes:
- Refcount is per-SuperSlab, not global
- Guards prevent symptom propagation, not root cause
- Root cause is in SuperSlab lifecycle management
🤖 Generated with Claude Code (https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
ChatGPT's diagnostic changes to address TLS_SLL_HDR_RESET issue.
Current status: Partial mitigation, but root cause remains.
Changes Applied:
1. SuperSlab Registry Fallback (hakmem_super_registry.h)
- Added legacy table probe when hash map lookup misses
- Prevents NULL returns for valid SuperSlabs during initialization
- Status: ✅ Works but may hide underlying registration issues
2. TLS SLL Push Validation (tls_sll_box.h)
- Reject push if SuperSlab lookup returns NULL
- Reject push if class_idx mismatch detected
- Added [TLS_SLL_PUSH_NO_SS] diagnostic message
- Status: ✅ Prevents list corruption (defensive)
3. SuperSlab Allocation Class Fix (superslab_allocate.c)
- Pass actual class_idx to sp_internal_allocate_superslab
- Prevents dummy class=8 causing OOB access
- Status: ✅ Root cause fix for allocation path
4. Debug Output Additions
- First 256 push/pop operations traced
- First 4 mismatches logged with details
- SuperSlab registration state logged
- Status: ✅ Diagnostic tool (not a fix)
5. TLS Hint Box Removed
- Deleted ss_tls_hint_box.{c,h} (Phase 1 optimization)
- Simplified to focus on stability first
- Status: ⏳ Can be re-added after root cause fixed
Current Problem (REMAINS UNSOLVED):
- [TLS_SLL_HDR_RESET] still occurs after ~60 seconds of sh8bench
- Pointer is 16 bytes offset from expected (class 1 → class 2 boundary)
- hak_super_lookup returns NULL for that pointer
- Suggests: Use-After-Free, Double-Free, or pointer arithmetic error
Root Cause Analysis:
- Pattern: Pointer offset by +16 (one class 1 stride)
- Timing: Cumulative problem (appears after 60s, not immediately)
- Location: Header corruption detected during TLS SLL pop
Remaining Issues:
⚠️ Registry fallback is defensive (may hide registration bugs)
⚠️ Push validation prevents symptoms but not root cause
⚠️ 16-byte pointer offset source unidentified
Next Steps for Investigation:
1. Full pointer arithmetic audit (Magazine ⇔ TLS SLL paths)
2. Enhanced logging at HDR_RESET point:
- Expected vs actual pointer value
- Pointer provenance (where it came from)
- Allocation trace for that block
3. Verify Headerless flag is OFF throughout build
4. Check for double-offset application in conversions
Technical Assessment:
- 60% root cause fixes (allocation class, validation)
- 40% defensive mitigation (registry fallback, push rejection)
Performance Impact:
- Registry fallback: +10-30 cycles on cold path (negligible)
- Push validation: +5-10 cycles per push (acceptable)
- Overall: < 2% performance impact estimated
Related Issues:
- Phase 1 TLS Hint Box removed temporarily
- Phase 2 Headerless blocked until stability achieved
🤖 Generated with Claude Code (https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Major refactoring to improve maintainability and debugging:
1. Split hakmem_tiny_superslab.c (1521 lines) into 7 focused files:
- superslab_allocate.c: SuperSlab allocation/deallocation
- superslab_backend.c: Backend allocation paths (legacy, shared)
- superslab_ace.c: ACE (Adaptive Cache Engine) logic
- superslab_slab.c: Slab initialization and bitmap management
- superslab_cache.c: LRU cache and prewarm cache management
- superslab_head.c: SuperSlabHead management and expansion
- superslab_stats.c: Statistics tracking and debugging
2. Created hakmem_tiny_superslab_internal.h for shared declarations
3. Added superslab_return_block() as single exit point for header writing:
- All backend allocations now go through this helper
- Prevents bugs where headers are forgotten in some paths
- Makes future debugging easier
4. Updated Makefile for new file structure
5. Added header writing to ss_legacy_backend_box.c and
ss_unified_backend_box.c (though not currently linked)
Note: Header corruption bug in Larson benchmark still exists.
Class 1-6 allocations go through TLS refill/carve paths, not backend.
Further investigation needed.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>