Add SuperSlab refcount pinning and critical failsafe guards

Major breakthrough: sh8bench now completes without SIGSEGV!
Added defensive refcounting and failsafe mechanisms to prevent
use-after-free and corruption propagation.

Changes:
1. SuperSlab Refcount Pinning (core/box/tls_sll_box.h)
   - tls_sll_push_impl: increment refcount before adding to list
   - tls_sll_pop_impl: decrement refcount when removing from list
   - Prevents SuperSlab from being freed while TLS SLL holds pointers

2. SuperSlab Release Guards (core/superslab_allocate.c, shared_pool_release.c)
   - Check refcount > 0 before freeing SuperSlab
   - If refcount > 0, defer release instead of freeing
   - Prevents use-after-free when TLS/remote/freelist hold stale pointers

3. TLS SLL Next Pointer Validation (core/box/tls_sll_box.h)
   - Detect invalid next pointer during traversal
   - Log [TLS_SLL_NEXT_INVALID] when detected
   - Drop list to prevent corruption propagation

4. Unified Cache Freelist Validation (core/front/tiny_unified_cache.c)
   - Validate freelist head before use
   - Log [UNIFIED_FREELIST_INVALID] for corrupted lists
   - Defensive drop to prevent bad allocations

5. Early Refcount Decrement Fix (core/tiny_free_fast.inc.h)
   - Removed ss_active_dec_one from fast path
   - Prevents premature refcount depletion
   - Defers decrement to proper cleanup path

Test Results:
 sh8bench completes successfully (exit code 0)
 No SIGSEGV or ABORT signals
 Short runs (5s) crash-free
⚠️ Multiple [TLS_SLL_NEXT_INVALID] / [UNIFIED_FREELIST_INVALID] logged
⚠️ Invalid pointers still present (stale references exist)

Status Analysis:
- Stability: ACHIEVED (no crashes)
- Root Cause: NOT FULLY SOLVED (invalid pointers remain)
- Approach: Defensive + refcount guards working well

Remaining Issues:
 Why does SuperSlab get unregistered while TLS SLL holds pointers?
 SuperSlab lifecycle: remote_queue / adopt / LRU interactions?
 Stale pointers indicate improper SuperSlab lifetime management

Performance Impact:
- Refcount operations: +1-3 cycles per push/pop (minor)
- Validation checks: +2-5 cycles (minor)
- Overall: < 5% overhead estimated

Next Investigation:
- Trace SuperSlab lifecycle (allocation → registration → unregister → free)
- Check remote_queue handling
- Verify adopt/LRU mechanisms
- Correlate stale pointer logs with SuperSlab unregister events

Log Volume Warning:
- May produce many diagnostic logs on long runs
- Consider ENV gating for production

Technical Notes:
- Refcount is per-SuperSlab, not global
- Guards prevent symptom propagation, not root cause
- Root cause is in SuperSlab lifecycle management

🤖 Generated with Claude Code (https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
Moe Charm (CI)
2025-12-03 21:56:52 +09:00
parent cd6177d1de
commit 19ce4c1ac4
8 changed files with 236 additions and 28 deletions

View File

@ -56,7 +56,7 @@ extern __thread const char* g_tls_sll_last_writer[TINY_NUM_CLASSES];
if (__builtin_expect(_head != NULL, 1)) { \
if (__builtin_expect((uintptr_t)_head == TINY_REMOTE_SENTINEL, 0)) { \
/* Break the chain defensively if sentinel leaked into TLS SLL */ \
g_tls_sll[(class_idx)].head = NULL; \
tls_sll_set_head_raw((class_idx), NULL, "fast_pop_sentinel"); \
g_tls_sll_last_writer[(class_idx)] = "fast_pop_sentinel"; \
if (g_tls_sll[(class_idx)].count > 0) g_tls_sll[(class_idx)].count--; \
(ptr_out) = NULL; \
@ -66,15 +66,14 @@ extern __thread const char* g_tls_sll_last_writer[TINY_NUM_CLASSES];
if (__builtin_expect(class_idx == 4 || class_idx == 6, 0)) { \
tls_sll_diag_next(class_idx, _head, _next, "fast_pop_next"); \
} \
g_tls_sll[(class_idx)].head = _next; \
g_tls_sll_last_writer[(class_idx)] = "fast_pop"; \
tls_sll_set_head_raw((class_idx), _next, "fast_pop"); \
if ((class_idx == 4 || class_idx == 6) && _next && ((uintptr_t)_next < 4096 || (uintptr_t)_next > 0x00007fffffffffffULL)) { \
static __thread uint8_t s_fast_pop_invalid_log[8] = {0}; \
if (s_fast_pop_invalid_log[(class_idx)] < 4) { \
fprintf(stderr, "[TLS_SLL_FAST_POP_INVALID] cls=%d head=%p next=%p\n", (class_idx), _head, _next); \
s_fast_pop_invalid_log[(class_idx)]++; \
} \
g_tls_sll[(class_idx)].head = NULL; \
tls_sll_set_head_raw((class_idx), NULL, "fast_pop_post_invalid"); \
/* keep count unchanged to flag drop */ \
g_tls_sll_last_writer[(class_idx)] = "fast_pop_post_invalid"; \
(ptr_out) = NULL; \
@ -126,15 +125,13 @@ extern __thread const char* g_tls_sll_last_writer[TINY_NUM_CLASSES];
} \
/* Link node using BASE as the canonical SLL node address. */ \
tiny_next_write((class_idx), _base, g_tls_sll[(class_idx)].head); \
g_tls_sll[(class_idx)].head = _base; \
g_tls_sll_last_writer[(class_idx)] = "fast_push"; \
tls_sll_set_head_raw((class_idx), _base, "fast_push"); \
g_tls_sll[(class_idx)].count++; \
} while(0)
#else
#define TINY_ALLOC_FAST_PUSH_INLINE(class_idx, ptr) do { \
tiny_next_write(class_idx, (ptr), g_tls_sll[(class_idx)].head); \
g_tls_sll[(class_idx)].head = (ptr); \
g_tls_sll_last_writer[(class_idx)] = "fast_push"; \
tls_sll_set_head_raw((class_idx), (ptr), "fast_push"); \
g_tls_sll[(class_idx)].count++; \
} while(0)
#endif