Add SuperSlab refcount pinning and critical failsafe guards

Major breakthrough: sh8bench now completes without SIGSEGV!
Added defensive refcounting and failsafe mechanisms to prevent
use-after-free and corruption propagation.

Changes:
1. SuperSlab Refcount Pinning (core/box/tls_sll_box.h)
   - tls_sll_push_impl: increment refcount before adding to list
   - tls_sll_pop_impl: decrement refcount when removing from list
   - Prevents SuperSlab from being freed while TLS SLL holds pointers

2. SuperSlab Release Guards (core/superslab_allocate.c, shared_pool_release.c)
   - Check refcount > 0 before freeing SuperSlab
   - If refcount > 0, defer release instead of freeing
   - Prevents use-after-free when TLS/remote/freelist hold stale pointers

3. TLS SLL Next Pointer Validation (core/box/tls_sll_box.h)
   - Detect invalid next pointer during traversal
   - Log [TLS_SLL_NEXT_INVALID] when detected
   - Drop list to prevent corruption propagation

4. Unified Cache Freelist Validation (core/front/tiny_unified_cache.c)
   - Validate freelist head before use
   - Log [UNIFIED_FREELIST_INVALID] for corrupted lists
   - Defensive drop to prevent bad allocations

5. Early Refcount Decrement Fix (core/tiny_free_fast.inc.h)
   - Removed ss_active_dec_one from fast path
   - Prevents premature refcount depletion
   - Defers decrement to proper cleanup path

Test Results:
 sh8bench completes successfully (exit code 0)
 No SIGSEGV or ABORT signals
 Short runs (5s) crash-free
⚠️ Multiple [TLS_SLL_NEXT_INVALID] / [UNIFIED_FREELIST_INVALID] logged
⚠️ Invalid pointers still present (stale references exist)

Status Analysis:
- Stability: ACHIEVED (no crashes)
- Root Cause: NOT FULLY SOLVED (invalid pointers remain)
- Approach: Defensive + refcount guards working well

Remaining Issues:
 Why does SuperSlab get unregistered while TLS SLL holds pointers?
 SuperSlab lifecycle: remote_queue / adopt / LRU interactions?
 Stale pointers indicate improper SuperSlab lifetime management

Performance Impact:
- Refcount operations: +1-3 cycles per push/pop (minor)
- Validation checks: +2-5 cycles (minor)
- Overall: < 5% overhead estimated

Next Investigation:
- Trace SuperSlab lifecycle (allocation → registration → unregister → free)
- Check remote_queue handling
- Verify adopt/LRU mechanisms
- Correlate stale pointer logs with SuperSlab unregister events

Log Volume Warning:
- May produce many diagnostic logs on long runs
- Consider ENV gating for production

Technical Notes:
- Refcount is per-SuperSlab, not global
- Guards prevent symptom propagation, not root cause
- Root cause is in SuperSlab lifecycle management

🤖 Generated with Claude Code (https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
Moe Charm (CI)
2025-12-03 21:56:52 +09:00
parent cd6177d1de
commit 19ce4c1ac4
8 changed files with 236 additions and 28 deletions

View File

@ -11,6 +11,7 @@
#include "../hakmem_env_cache.h" // Priority-2: ENV cache (eliminate syscalls)
#include <stdlib.h>
#include <string.h>
#include <stdatomic.h>
// Phase 23-E: Forward declarations
extern __thread TinyTLSSlab g_tls_slabs[TINY_NUM_CLASSES]; // From hakmem_tiny_superslab.c
@ -337,6 +338,39 @@ void* unified_cache_refill(int class_idx) {
if (m->freelist) {
// Freelist pop
void* p = m->freelist;
// Validate freelist head before dereferencing
do {
SuperSlab* fl_ss = hak_super_lookup(p);
int fl_cap = fl_ss ? ss_slabs_capacity(fl_ss) : 0;
int fl_idx = (fl_ss && fl_ss->magic == SUPERSLAB_MAGIC) ? slab_index_for(fl_ss, p) : -1;
uint8_t fl_cls = (fl_idx >= 0 && fl_idx < fl_cap) ? fl_ss->slabs[fl_idx].class_idx : 0xff;
if (!fl_ss || fl_ss->magic != SUPERSLAB_MAGIC ||
fl_idx != tls->slab_idx || fl_ss != tls->ss ||
fl_cls != (uint8_t)class_idx) {
static _Atomic uint32_t g_fl_invalid = 0;
uint32_t shot = atomic_fetch_add_explicit(&g_fl_invalid, 1, memory_order_relaxed);
if (shot < 8) {
fprintf(stderr,
"[UNIFIED_FREELIST_INVALID] cls=%d p=%p ss=%p slab=%d meta_used=%u tls_ss=%p tls_slab=%d cls_meta=%u\n",
class_idx,
p,
(void*)fl_ss,
fl_idx,
m->used,
(void*)tls->ss,
tls->slab_idx,
(unsigned)fl_cls);
}
// Drop invalid freelist to avoid SEGV and force slow refill
m->freelist = NULL;
p = NULL;
}
} while (0);
if (!p) {
break;
}
void* next_node = tiny_next_read(class_idx, p);
// ROOT CAUSE FIX: Write header BEFORE exposing block (but AFTER reading next)