ChatGPT's diagnostic changes to address TLS_SLL_HDR_RESET issue.
Current status: Partial mitigation, but root cause remains.
Changes Applied:
1. SuperSlab Registry Fallback (hakmem_super_registry.h)
- Added legacy table probe when hash map lookup misses
- Prevents NULL returns for valid SuperSlabs during initialization
- Status: ✅ Works but may hide underlying registration issues
2. TLS SLL Push Validation (tls_sll_box.h)
- Reject push if SuperSlab lookup returns NULL
- Reject push if class_idx mismatch detected
- Added [TLS_SLL_PUSH_NO_SS] diagnostic message
- Status: ✅ Prevents list corruption (defensive)
3. SuperSlab Allocation Class Fix (superslab_allocate.c)
- Pass actual class_idx to sp_internal_allocate_superslab
- Prevents dummy class=8 causing OOB access
- Status: ✅ Root cause fix for allocation path
4. Debug Output Additions
- First 256 push/pop operations traced
- First 4 mismatches logged with details
- SuperSlab registration state logged
- Status: ✅ Diagnostic tool (not a fix)
5. TLS Hint Box Removed
- Deleted ss_tls_hint_box.{c,h} (Phase 1 optimization)
- Simplified to focus on stability first
- Status: ⏳ Can be re-added after root cause fixed
Current Problem (REMAINS UNSOLVED):
- [TLS_SLL_HDR_RESET] still occurs after ~60 seconds of sh8bench
- Pointer is 16 bytes offset from expected (class 1 → class 2 boundary)
- hak_super_lookup returns NULL for that pointer
- Suggests: Use-After-Free, Double-Free, or pointer arithmetic error
Root Cause Analysis:
- Pattern: Pointer offset by +16 (one class 1 stride)
- Timing: Cumulative problem (appears after 60s, not immediately)
- Location: Header corruption detected during TLS SLL pop
Remaining Issues:
⚠️ Registry fallback is defensive (may hide registration bugs)
⚠️ Push validation prevents symptoms but not root cause
⚠️ 16-byte pointer offset source unidentified
Next Steps for Investigation:
1. Full pointer arithmetic audit (Magazine ⇔ TLS SLL paths)
2. Enhanced logging at HDR_RESET point:
- Expected vs actual pointer value
- Pointer provenance (where it came from)
- Allocation trace for that block
3. Verify Headerless flag is OFF throughout build
4. Check for double-offset application in conversions
Technical Assessment:
- 60% root cause fixes (allocation class, validation)
- 40% defensive mitigation (registry fallback, push rejection)
Performance Impact:
- Registry fallback: +10-30 cycles on cold path (negligible)
- Push validation: +5-10 cycles per push (acceptable)
- Overall: < 2% performance impact estimated
Related Issues:
- Phase 1 TLS Hint Box removed temporarily
- Phase 2 Headerless blocked until stability achieved
🤖 Generated with Claude Code (https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Problem: hak_core_init.inc.h references KPI measurement variables
(g_latency_histogram, g_latency_samples, g_baseline_soft_pf, etc.)
but hakmem.c was including hak_kpi_util.inc.h AFTER hak_core_init.inc.h,
causing undefined reference errors.
Solution: Reorder includes so hak_kpi_util.inc.h (definition) comes
before hak_core_init.inc.h (usage).
Build result: ✅ Success (libhakmem.so 547KB, 0 errors)
Minor changes:
- Added extern __thread declarations for TLS SLL debug variables
- Added signal handler logging for debug_dump_last_push
- Improved hakmem_tiny.c structure for Phase 2 preparation
🤖 Generated with Claude Code + Task Agent
Co-Authored-By: Gemini <gemini@example.com>
Co-Authored-By: Claude <noreply@anthropic.com>
This commit introduces a comprehensive tracing mechanism for allocation failures within the Adaptive Cache Engine (ACE) component. This feature allows for precise identification of the root cause for Out-Of-Memory (OOM) issues related to ACE allocations.
Key changes include:
- **ACE Tracing Implementation**:
- Added environment variable to enable/disable detailed logging of allocation failures.
- Instrumented , , and to distinguish between "Threshold" (size class mismatch), "Exhaustion" (pool depletion), and "MapFail" (OS memory allocation failure).
- **Build System Fixes**:
- Corrected to ensure is properly linked into , resolving an error.
- **LD_PRELOAD Wrapper Adjustments**:
- Investigated and understood the wrapper's behavior under , particularly its interaction with and checks.
- Enabled debugging flags for environment to prevent unintended fallbacks to 's for non-tiny allocations, allowing comprehensive testing of the allocator.
- **Debugging & Verification**:
- Introduced temporary verbose logging to pinpoint execution flow issues within interception and routing. These temporary logs have been removed.
- Created to facilitate testing of the tracing features.
This feature will significantly aid in diagnosing and resolving allocation-related OOM issues in by providing clear insights into the failure pathways.
### Changes:
1. **Removed diagnostic from wrapper** (hakmem_tiny.c:1542)
- Was: getenv() + fprintf() on every wrapper call
- Now: Direct return tiny_alloc_fast(size)
- Relies on LTO (-flto) for inlining
2. **Removed counter overhead from malloc()** (hakmem.c:1242)
- Was: 4 TLS counter increments per malloc
- g_malloc_total_calls++
- g_malloc_tiny_size_match++
- g_malloc_fast_path_tried++
- g_malloc_fast_path_null++ (on miss)
- Now: Zero counter overhead
### Performance Results:
```
Before (with overhead): 1.51M ops/s
After (zero overhead): 1.59M ops/s (+5% 🎉)
Baseline (old impl): 1.68M ops/s (-5% gap remains)
System malloc: 8.08M ops/s (reference)
```
### Analysis:
**What was heavy:**
- Counter increments: ~4 TLS writes per malloc (cache pollution)
- Diagnostic: getenv() + fprintf() check (even if disabled)
- These added ~80K ops/s overhead
**Remaining gap (-5% vs baseline):**
Box Theory (1.59M) vs Old implementation (1.68M)
- Likely due to: ownership check in free path
- Or: refill backend (sll_refill_small_from_ss vs hak_tiny_alloc x16)
### Bottleneck Update:
From profiling data (2,418 cycles per fast path):
```
Fast path time: 49.5M cycles (49.1% of total)
Refill time: 51.3M cycles (50.9% of total)
Counter overhead removed: ~5% improvement
LTO should inline wrapper: Further gains expected
```
### Status:
✅ IMPROVEMENT - Removed overhead, 5% faster
❌ STILL SHORT - 5% slower than baseline (1.68M target)
### Next Steps:
A. Investigate ownership check overhead in free path
B. Compare refill backend efficiency
C. Consider reverting to old implementation if gap persists
Related: LARSON_PERFORMANCE_ANALYSIS_2025_11_05.md
### Problem Identified:
Previous commit routed malloc() → guards → hak_alloc_at() → Box Theory
This added massive overhead (guard checks, function calls) defeating the
"3-4 instruction" fast path promise.
### Root Cause:
"命令数減って遅くなるのはおかしい" - User's insight was correct!
Box Theory claims 3-4 instructions, but routing added dozens of instructions
before reaching TLS freelist pop.
### Fix:
Move Box Theory call to malloc() entry point (line ~1253), BEFORE all guards:
```c
#ifdef HAKMEM_TINY_PHASE6_BOX_REFACTOR
if (size <= TINY_FAST_THRESHOLD) {
void* ptr = hak_tiny_alloc_fast_wrapper(size);
if (ptr) return ptr; // ✅ Fast path: No guards, no overhead
}
#endif
// SLOW PATH: All guards here...
```
### Performance Results:
```
Baseline (old tiny_fast_alloc): 1.68M ops/s
Box Theory (no env vars): 1.22M ops/s (-27%)
Box Theory (with env vars): 1.39M ops/s (-17%) ← Improved!
System malloc: 8.08M ops/s
CLAUDE.md expectation: 2.75M (+64%) ~ 4.19M (+150%) ← Not reached
```
### Env Vars Used:
```
HAKMEM_TINY_TLS_SLL=1 HAKMEM_TINY_TLS_LIST=0 HAKMEM_TINY_HOTMAG=0
HAKMEM_WRAP_TINY=1 HAKMEM_TINY_SS_ADOPT=1 HAKMEM_TINY_TRACE_RING=0
HAKMEM_SAFE_FREE=0 HAKMEM_TINY_REFILL_COUNT=128
```
### Verification:
- ✅ HAKMEM_TINY_PHASE6_BOX_REFACTOR=1 confirmed active
- ✅ hak_tiny_alloc_fast_wrapper() called (FRONT diagnostics)
- ✅ Routing now bypasses guards for fast path
- ❌ Still -17% slower than baseline (investigation needed)
### Status:
🔬 PARTIAL SUCCESS - Routing fixed, but performance below expectation.
Box Theory is active and bypassing guards, but still slower than old implementation.
### Next Steps:
- Compare refill implementations (old vs Box Theory)
- Profile to identify specific bottleneck
- Investigate why Box Theory underperforms vs CLAUDE.md claims
Related: LARSON_PERFORMANCE_ANALYSIS_2025_11_05.md, CLAUDE.md Phase 6-1.7
### Changes:
- **Removed duplicate fast paths**: Disabled HAKMEM_TINY_FAST_PATH in:
- malloc() entry point (line ~1257)
- hak_alloc_at() helper (line ~682)
- **Unified to Box Theory**: All tiny allocations now use Box Theory's
hak_tiny_alloc_fast_wrapper() at line ~712 (HAKMEM_TINY_PHASE6_BOX_REFACTOR)
### Rationale:
- Previous implementation had **2 fast path checks** (double overhead)
- Box Theory (tiny_alloc_fast.inc.h) provides optimized 3-4 instruction path
- CLAUDE.md claims +64% (debug), +150% (production) with Box Theory
- Attempt to eliminate redundant checks and unify to single fast path
### Performance Results:
⚠️ **REGRESSION** - Performance decreased:
```
Baseline (old tiny_fast_alloc): 1.68M ops/s
Box Theory (unified): 1.35M ops/s (-20%)
System malloc: 8.08M ops/s (reference)
```
### Status:
🔬 **EXPERIMENTAL** - This commit documents the attempt but shows regression.
Possible issues:
1. Box Theory may need additional tuning (env vars not sufficient)
2. Refill backend may be slower than old implementation
3. TLS freelist initialization overhead
4. Missing optimizations in Box Theory integration
### Next Steps:
- Profile to identify why Box Theory is slower
- Compare refill efficiency: old vs Box Theory
- Check if TLS SLL variables are properly initialized
- Consider reverting if root cause not found
Related: LARSON_PERFORMANCE_ANALYSIS_2025_11_05.md, CLAUDE.md Phase 6-1.7
Implementation: Move HAKMEM_TINY_FAST_PATH check BEFORE all guard checks
in malloc(), inspired by mimalloc/tcache entry point design.
Strategy:
- tcache has 0 branches before fast path
- mimalloc has 1-2 branches before fast path
- Old HAKMEM had 8+ branches before fast path
- Phase 1: Move fast path to line 1, add branch prediction hints
Changes in core/hakmem.c:
1. Fast Path First: Size check → Init check → Cache hit (3 branches)
2. Slow Path: All guards moved after fast path (rare cases)
3. Branch hints: __builtin_expect() for hot paths
Expected results (from research):
- ST: 0.46M → 1.4-2.3M ops/s (+204-400%)
- MT: 1.86M → 3.7-5.6M ops/s (+99-201%)
Actual results (Larson 2s 8-128B 1024):
- ST: 0.377M → 0.424M ops/s (+12% only)
- MT: 1.856M → 1.453M ops/s (-22% regression!)
Analysis:
- Similar pattern to previous Option A test (+42% ST, -20% MT)
- Entry point reordering alone is insufficient
- True bottleneck may be:
1. tiny_fast_alloc() internals (size-to-class, cache access)
2. Refill cost (1,600 cycles for 16 individual calls)
3. Need Batch Refill optimization (Phase 3) as priority
Next steps:
- Investigate refill bottleneck with perf profiling
- Consider implementing Phase 3 (Batch Refill) before Phase 2
- May need combination of multiple optimizations for breakthrough
Related: LARSON_PERFORMANCE_ANALYSIS_2025_11_05.md
Changes:
- Reorder malloc() to prioritize Fast Path (initialized + tiny size check first)
- Move Fast Path check before all guard checks (recursion, LD_PRELOAD, etc.)
- Optimize free() with same strategy (initialized check first)
- Add branch prediction hints (__builtin_expect)
Implementation:
- malloc(): Fast Path now executes with 3 branches total
- Branch 1+2: g_initialized && size <= TINY_FAST_THRESHOLD
- Branch 3: tiny_fast_alloc() cache hit check
- Slow Path: All guard checks moved after Fast Path miss
- free(): Fast Path with 1-2 branches
- Branch 1: g_initialized check
- Direct to hak_free_at() on normal case
Performance Results (Larson benchmark, size=8-128B):
Single-thread (threads=1):
- Before: 0.46M ops/s (10.7% of system malloc)
- After: 0.65M ops/s (15.4% of system malloc)
- Change: +42% improvement ✓
Multi-thread (threads=4):
- Before: 1.81M ops/s (25.0% of system malloc)
- After: 1.44M ops/s (19.9% of system malloc)
- Change: -20% regression ✗
Analysis:
- ST improvement shows Fast Path optimization works
- MT regression suggests contention or cache issues
- Did not meet target (+200-400%), further optimization needed
Next Steps:
- Investigate MT regression (cache coherency?)
- Consider more aggressive inlining
- Explore Option B (Refill optimization)