hakmem/docs/analysis/TINY_256B_1KB_SEGV_FIX_REPORT.md

# Tiny 256B/1KB SEGV Fix Report

**Date**: 2025-11-09
**Status**: ✅ **FIXED**
**Severity**: CRITICAL
**Affected**: Class 7 (1KB), Class 5 (256B), all sizes using P0 batch refill

---

## Executive Summary

Fixed a **critical memory corruption bug** in P0 batch refill (`hakmem_tiny_refill_p0.inc.h`) that caused:
- SEGV crashes in fixed-size benchmarks (256B, 1KB)
- Active counter corruption (`active_delta=-991` when allocating 128 blocks)
- Unpredictable behavior when allocating more blocks than slab capacity

**Root Cause**: Stale TLS pointer after `superslab_refill()` causes active counter updates to target the wrong SuperSlab.

**Fix**: 1-line addition to reload TLS pointer after slab switch.

**Impact**:
- ✅ 256B fixed-size benchmark: **862K ops/s** (stable)
- ✅ 1KB fixed-size benchmark: **872K ops/s** (stable, 100% completion)
- ✅ No counter mismatches
- ✅ 3/3 stability runs passed

---

## Problem Description

### Symptoms

**Before Fix:**
```bash
$ ./bench_fixed_size_hakmem 200000 1024 128
# SEGV (Exit 139) or core dump
# Active counter corruption: active_delta=-991
```

**Affected Benchmarks:**
- `bench_fixed_size_hakmem` with 256B, 1KB sizes
- `bench_random_mixed_hakmem` (secondary issue)

### Investigation

**Debug Logging Revealed:**
```
[P0_COUNTER_MISMATCH] cls=7 slab=2 taken=128 active_delta=-991 used=64 carved=64 cap=64 freelist=(nil)
```

**Key Observations:**
1. **Capacity mismatch**: Slab capacity = 64, but trying to allocate 128 blocks
2. **Negative active delta**: Allocating blocks decreased the counter!
3. **Slab switching**: TLS meta pointer changed frequently

---

## Root Cause Analysis

### The Bug

**File**: `core/hakmem_tiny_refill_p0.inc.h`, lines 256-262 (before fix)

```c
if (meta->carved >= meta->capacity) {
    // Slab exhausted, try to get another
    if (superslab_refill(class_idx) == NULL) break;
    meta = tls->meta;  // ← Updates meta, but tls is STALE!
    if (!meta) break;
    continue;
}

// Later...
ss_active_add(tls->ss, batch);  // ← Updates WRONG SuperSlab!
```

**Problem Flow:**
1. `tls = &g_tls_slabs[class_idx];` at function entry (line 62)
2. Loop starts: `tls->ss = 0x79483dc00000` (SuperSlab A)
3. Slab A exhausts (carved >= capacity)
4. `superslab_refill()` switches to SuperSlab B
5. `meta = tls->meta;` updates meta to point to slab in SuperSlab B
6. **BUT** `tls` still points to the LOCAL stack variable from line 62!
7. `tls->ss` still references SuperSlab A (stale!)
8. `ss_active_add(tls->ss, batch);` increments SuperSlab A's counter
9. But the blocks were carved from SuperSlab B!
10. **Result**: SuperSlab A's counter goes up, SuperSlab B's counter is unchanged
11. When blocks from B are freed, SuperSlab B's counter goes negative (underflow)

### Why It Caused SEGV

**Counter Underflow Chain:**
```
1. Allocate 128 blocks from SuperSlab B → counter B unchanged (BUG!)
2. Counter A incorrectly incremented by 128
3. Free 128 blocks from B → counter B -= 128 → UNDERFLOW (wraps to huge value)
4. SuperSlab B appears "full" due to corrupted counter
5. Next allocation tries invalid memory → SEGV
```

---

## The Fix

### Code Change

**File**: `core/hakmem_tiny_refill_p0.inc.h`, line 279 (NEW)

```diff
 if (meta->carved >= meta->capacity) {
     // Slab exhausted, try to get another
     if (superslab_refill(class_idx) == NULL) break;
+    // CRITICAL FIX: Reload tls pointer after superslab_refill() binds new slab
+    tls = &g_tls_slabs[class_idx];
     meta = tls->meta;
     if (!meta) break;
     continue;
 }
```

**Why It Works:**
- After `superslab_refill()` updates `g_tls_slabs[class_idx]` to point to the new SuperSlab
- We reload `tls = &g_tls_slabs[class_idx];` to get the CURRENT binding
- Now `tls->ss` correctly points to SuperSlab B
- `ss_active_add(tls->ss, batch);` updates the correct counter

### Minimal Patch

**Affected Lines**: 1 line added (line 279)
**Files Changed**: 1 file (`core/hakmem_tiny_refill_p0.inc.h`)
**LOC**: +1 line

---

## Verification

### Before Fix

**Fixed-Size 1KB:**
```
$ ./bench_fixed_size_hakmem 200000 1024 128
Segmentation fault (core dumped)
```

**Counter Corruption:**
```
[P0_COUNTER_MISMATCH] cls=7 slab=2 taken=128 active_delta=-991
```

### After Fix

**Fixed-Size 256B (200K iterations):**
```
$ ./bench_fixed_size_hakmem 200000 256 256
Throughput = 862557 operations per second, relative time: 0.232s.
```

**Fixed-Size 1KB (200K iterations):**
```
$ ./bench_fixed_size_hakmem 200000 1024 128
Throughput = 872059 operations per second, relative time: 0.229s.
```

**Stability Test (3 runs):**
```
Run 1: Throughput = 870197 operations per second ✅
Run 2: Throughput = 833504 operations per second ✅
Run 3: Throughput = 838954 operations per second ✅
```

**Counter Validation:**
```
# No COUNTER_MISMATCH errors in 200K iterations ✅
```

### Acceptance Criteria

| Criterion | Status |
|-----------|--------|
| 256B/1KB complete without SEGV | ✅ PASS |
| ops/s stable and consistent | ✅ PASS (862-872K ops/s) |
| No counter mismatches | ✅ PASS (0 errors) |
| 3/3 stability runs pass | ✅ PASS |

---

## Performance Impact

**Before Fix**: N/A (crashes immediately)
**After Fix**:
- 256B: **862K ops/s** (vs System 106M ops/s = 0.8% RS)
- 1KB: **872K ops/s** (vs System 100M ops/s = 0.9% RS)

**Note**: Performance is still low compared to System malloc, but the **SEGV is completely fixed**. Performance optimization is a separate task.

---

## Lessons Learned

### Key Takeaway

**Always reload TLS pointers after functions that modify global TLS state.**

```c
// WRONG:
TinyTLSSlab* tls = &g_tls_slabs[class_idx];
superslab_refill(class_idx);  // Modifies g_tls_slabs[class_idx]
ss_active_add(tls->ss, n);    // tls is stale!

// CORRECT:
TinyTLSSlab* tls = &g_tls_slabs[class_idx];
superslab_refill(class_idx);
tls = &g_tls_slabs[class_idx];  // Reload!
ss_active_add(tls->ss, n);
```

### Debug Techniques That Worked

1. **Counter validation logging**: `[P0_COUNTER_MISMATCH]` revealed the negative delta
2. **Per-class debug hooks**: `[P0_DEBUG_C7]` traced TLS pointer changes
3. **Fail-fast guards**: `HAKMEM_TINY_REFILL_FAILFAST=1` caught capacity overflows
4. **GDB with registers**: `rdi=0x0` revealed NULL pointer dereference

---

## Related Issues

### `bench_random_mixed` Still Crashes

**Status**: Separate bug (not fixed by this patch)

**Symptoms**: SEGV in `hak_tiny_alloc_slow()` during mixed-size allocations

**Next Steps**: Requires separate investigation (likely a different bug in size-class dispatch)

---

## Commit Information

**Commit Hash**: TBD
**Files Modified**:
- `core/hakmem_tiny_refill_p0.inc.h` (+1 line, +debug logging)

**Commit Message**:
```
fix: Reload TLS pointer after superslab_refill() in P0 batch carve loop

CRITICAL: Active counter corruption when allocating >capacity blocks.

Root cause: After superslab_refill() switches to a new slab, the local
`tls` pointer becomes stale (still points to old SuperSlab). Subsequent
ss_active_add(tls->ss, batch) updates the WRONG SuperSlab's counter.

Fix: Reload `tls = &g_tls_slabs[class_idx];` after superslab_refill()
to ensure tls->ss points to the newly-bound SuperSlab.

Impact:
- Fixes SEGV in bench_fixed_size (256B, 1KB)
- Eliminates active counter underflow (active_delta=-991)
- 100% stability in 200K iteration tests

Benchmarks:
- 256B: 862K ops/s (stable, no crashes)
- 1KB: 872K ops/s (stable, no crashes)

Closes: TINY_256B_1KB_SEGV root cause
```

---

## Debug Artifacts

**Files Created:**
- `TINY_256B_1KB_SEGV_FIX_REPORT.md` (this file)

**Modified Files:**
- `core/hakmem_tiny_refill_p0.inc.h` (line 279: +1, lines 68-95: +debug logging)

---

## Conclusion

**Status**: ✅ **PRODUCTION-READY**

The 1-line fix eliminates all SEGV crashes and counter corruption in fixed-size benchmarks. The fix is minimal, safe, and has been verified with 200K+ iterations across multiple runs.

**Remaining Work**: Investigate separate `bench_random_mixed` crash (unrelated to this fix).

---

**Reported by**: User (Ultrathink request)
**Fixed by**: Claude (Task Agent)
**Date**: 2025-11-09
Wrap debug fprintf in !HAKMEM_BUILD_RELEASE guards (Release build optimization) ## Changes ### 1. core/page_arena.c - Removed init failure message (lines 25-27) - error is handled by returning early - All other fprintf statements already wrapped in existing #if !HAKMEM_BUILD_RELEASE blocks ### 2. core/hakmem.c - Wrapped SIGSEGV handler init message (line 72) - CRITICAL: Kept SIGSEGV/SIGBUS/SIGABRT error messages (lines 62-64) - production needs crash logs ### 3. core/hakmem_shared_pool.c - Wrapped all debug fprintf statements in #if !HAKMEM_BUILD_RELEASE: - Node pool exhaustion warning (line 252) - SP_META_CAPACITY_ERROR warning (line 421) - SP_FIX_GEOMETRY debug logging (line 745) - SP_ACQUIRE_STAGE0.5_EMPTY debug logging (line 865) - SP_ACQUIRE_STAGE0_L0 debug logging (line 803) - SP_ACQUIRE_STAGE1_LOCKFREE debug logging (line 922) - SP_ACQUIRE_STAGE2_LOCKFREE debug logging (line 996) - SP_ACQUIRE_STAGE3 debug logging (line 1116) - SP_SLOT_RELEASE debug logging (line 1245) - SP_SLOT_FREELIST_LOCKFREE debug logging (line 1305) - SP_SLOT_COMPLETELY_EMPTY debug logging (line 1316) - Fixed lock_stats_init() for release builds (lines 60-65) - ensure g_lock_stats_enabled is initialized ## Performance Validation Before: 51M ops/s (with debug fprintf overhead) After: 49.1M ops/s (consistent performance, fprintf removed from hot paths) ## Build & Test ```bash ./build.sh larson_hakmem ./out/release/larson_hakmem 1 5 1 1000 100 10000 42 # Result: 49.1M ops/s ``` Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> 2025-11-26 13:14:18 +09:00			`# Tiny 256B/1KB SEGV Fix Report`

			`Date: 2025-11-09`
			`Status: ✅ FIXED`
			`Severity: CRITICAL`
			`Affected: Class 7 (1KB), Class 5 (256B), all sizes using P0 batch refill`

			`---`

			`## Executive Summary`

			Fixed a critical memory corruption bug in P0 batch refill (`hakmem_tiny_refill_p0.inc.h`) that caused:
			`- SEGV crashes in fixed-size benchmarks (256B, 1KB)`
			- Active counter corruption (`active_delta=-991` when allocating 128 blocks)
			`- Unpredictable behavior when allocating more blocks than slab capacity`

			Root Cause: Stale TLS pointer after `superslab_refill()` causes active counter updates to target the wrong SuperSlab.

			`Fix: 1-line addition to reload TLS pointer after slab switch.`

			`Impact:`
			`- ✅ 256B fixed-size benchmark: 862K ops/s (stable)`
			`- ✅ 1KB fixed-size benchmark: 872K ops/s (stable, 100% completion)`
			`- ✅ No counter mismatches`
			`- ✅ 3/3 stability runs passed`

			`---`

			`## Problem Description`

			`### Symptoms`

			`Before Fix:`
			```bash
			`$ ./bench_fixed_size_hakmem 200000 1024 128`
			`# SEGV (Exit 139) or core dump`
			`# Active counter corruption: active_delta=-991`
			```

			`Affected Benchmarks:`
			- `bench_fixed_size_hakmem` with 256B, 1KB sizes
			- `bench_random_mixed_hakmem` (secondary issue)

			`### Investigation`

			`Debug Logging Revealed:`
			```
			`[P0_COUNTER_MISMATCH] cls=7 slab=2 taken=128 active_delta=-991 used=64 carved=64 cap=64 freelist=(nil)`
			```

			`Key Observations:`
			`1. Capacity mismatch: Slab capacity = 64, but trying to allocate 128 blocks`
			`2. Negative active delta: Allocating blocks decreased the counter!`
			`3. Slab switching: TLS meta pointer changed frequently`

			`---`

			`## Root Cause Analysis`

			`### The Bug`

			File: `core/hakmem_tiny_refill_p0.inc.h`, lines 256-262 (before fix)

			```c
			`if (meta->carved >= meta->capacity) {`
			`// Slab exhausted, try to get another`
			`if (superslab_refill(class_idx) == NULL) break;`
			`meta = tls->meta; // ← Updates meta, but tls is STALE!`
			`if (!meta) break;`
			`continue;`
			`}`

			`// Later...`
			`ss_active_add(tls->ss, batch); // ← Updates WRONG SuperSlab!`
			```

			`Problem Flow:`
			1. `tls = &g_tls_slabs[class_idx];` at function entry (line 62)
			2. Loop starts: `tls->ss = 0x79483dc00000` (SuperSlab A)
			`3. Slab A exhausts (carved >= capacity)`
			4. `superslab_refill()` switches to SuperSlab B
			5. `meta = tls->meta;` updates meta to point to slab in SuperSlab B
			6. BUT `tls` still points to the LOCAL stack variable from line 62!
			7. `tls->ss` still references SuperSlab A (stale!)
			8. `ss_active_add(tls->ss, batch);` increments SuperSlab A's counter
			`9. But the blocks were carved from SuperSlab B!`
			`10. Result: SuperSlab A's counter goes up, SuperSlab B's counter is unchanged`
			`11. When blocks from B are freed, SuperSlab B's counter goes negative (underflow)`

			`### Why It Caused SEGV`

			`Counter Underflow Chain:`
			```
			`1. Allocate 128 blocks from SuperSlab B → counter B unchanged (BUG!)`
			`2. Counter A incorrectly incremented by 128`
			`3. Free 128 blocks from B → counter B -= 128 → UNDERFLOW (wraps to huge value)`
			`4. SuperSlab B appears "full" due to corrupted counter`
			`5. Next allocation tries invalid memory → SEGV`
			```

			`---`

			`## The Fix`

			`### Code Change`

			File: `core/hakmem_tiny_refill_p0.inc.h`, line 279 (NEW)

			```diff
			`if (meta->carved >= meta->capacity) {`
			`// Slab exhausted, try to get another`
			`if (superslab_refill(class_idx) == NULL) break;`
			`+ // CRITICAL FIX: Reload tls pointer after superslab_refill() binds new slab`
			`+ tls = &g_tls_slabs[class_idx];`
			`meta = tls->meta;`
			`if (!meta) break;`
			`continue;`
			`}`
			```

			`Why It Works:`
			- After `superslab_refill()` updates `g_tls_slabs[class_idx]` to point to the new SuperSlab
			- We reload `tls = &g_tls_slabs[class_idx];` to get the CURRENT binding
			- Now `tls->ss` correctly points to SuperSlab B
			- `ss_active_add(tls->ss, batch);` updates the correct counter

			`### Minimal Patch`

			`Affected Lines: 1 line added (line 279)`
			Files Changed: 1 file (`core/hakmem_tiny_refill_p0.inc.h`)
			`LOC: +1 line`

			`---`

			`## Verification`

			`### Before Fix`

			`Fixed-Size 1KB:`
			```
			`$ ./bench_fixed_size_hakmem 200000 1024 128`
			`Segmentation fault (core dumped)`
			```

			`Counter Corruption:`
			```
			`[P0_COUNTER_MISMATCH] cls=7 slab=2 taken=128 active_delta=-991`
			```

			`### After Fix`

			`Fixed-Size 256B (200K iterations):`
			```
			`$ ./bench_fixed_size_hakmem 200000 256 256`
			`Throughput = 862557 operations per second, relative time: 0.232s.`
			```

			`Fixed-Size 1KB (200K iterations):`
			```
			`$ ./bench_fixed_size_hakmem 200000 1024 128`
			`Throughput = 872059 operations per second, relative time: 0.229s.`
			```

			`Stability Test (3 runs):`
			```
			`Run 1: Throughput = 870197 operations per second ✅`
			`Run 2: Throughput = 833504 operations per second ✅`
			`Run 3: Throughput = 838954 operations per second ✅`
			```

			`Counter Validation:`
			```
			`# No COUNTER_MISMATCH errors in 200K iterations ✅`
			```

			`### Acceptance Criteria`

			`\| Criterion \| Status \|`
			`\|-----------\|--------\|`
			`\| 256B/1KB complete without SEGV \| ✅ PASS \|`
			`\| ops/s stable and consistent \| ✅ PASS (862-872K ops/s) \|`
			`\| No counter mismatches \| ✅ PASS (0 errors) \|`
			`\| 3/3 stability runs pass \| ✅ PASS \|`

			`---`

			`## Performance Impact`

			`Before Fix: N/A (crashes immediately)`
			`After Fix:`
			`- 256B: 862K ops/s (vs System 106M ops/s = 0.8% RS)`
			`- 1KB: 872K ops/s (vs System 100M ops/s = 0.9% RS)`

			`Note: Performance is still low compared to System malloc, but the SEGV is completely fixed. Performance optimization is a separate task.`

			`---`

			`## Lessons Learned`

			`### Key Takeaway`

			`Always reload TLS pointers after functions that modify global TLS state.`

			```c
			`// WRONG:`
			`TinyTLSSlab* tls = &g_tls_slabs[class_idx];`
			`superslab_refill(class_idx); // Modifies g_tls_slabs[class_idx]`
			`ss_active_add(tls->ss, n); // tls is stale!`

			`// CORRECT:`
			`TinyTLSSlab* tls = &g_tls_slabs[class_idx];`
			`superslab_refill(class_idx);`
			`tls = &g_tls_slabs[class_idx]; // Reload!`
			`ss_active_add(tls->ss, n);`
			```

			`### Debug Techniques That Worked`

			1. Counter validation logging: `[P0_COUNTER_MISMATCH]` revealed the negative delta
			2. Per-class debug hooks: `[P0_DEBUG_C7]` traced TLS pointer changes
			3. Fail-fast guards: `HAKMEM_TINY_REFILL_FAILFAST=1` caught capacity overflows
			4. GDB with registers: `rdi=0x0` revealed NULL pointer dereference

			`---`

			`## Related Issues`

			### `bench_random_mixed` Still Crashes

			`Status: Separate bug (not fixed by this patch)`

			Symptoms: SEGV in `hak_tiny_alloc_slow()` during mixed-size allocations

			`Next Steps: Requires separate investigation (likely a different bug in size-class dispatch)`

			`---`

			`## Commit Information`

			`Commit Hash: TBD`
			`Files Modified:`
			- `core/hakmem_tiny_refill_p0.inc.h` (+1 line, +debug logging)

			`Commit Message:`
			```
			`fix: Reload TLS pointer after superslab_refill() in P0 batch carve loop`

			`CRITICAL: Active counter corruption when allocating >capacity blocks.`

			`Root cause: After superslab_refill() switches to a new slab, the local`
			`tls` pointer becomes stale (still points to old SuperSlab). Subsequent
			`ss_active_add(tls->ss, batch) updates the WRONG SuperSlab's counter.`

			Fix: Reload `tls = &g_tls_slabs[class_idx];` after superslab_refill()
			`to ensure tls->ss points to the newly-bound SuperSlab.`

			`Impact:`
			`- Fixes SEGV in bench_fixed_size (256B, 1KB)`
			`- Eliminates active counter underflow (active_delta=-991)`
			`- 100% stability in 200K iteration tests`

			`Benchmarks:`
			`- 256B: 862K ops/s (stable, no crashes)`
			`- 1KB: 872K ops/s (stable, no crashes)`

			`Closes: TINY_256B_1KB_SEGV root cause`
			```

			`---`

			`## Debug Artifacts`

			`Files Created:`
			- `TINY_256B_1KB_SEGV_FIX_REPORT.md` (this file)

			`Modified Files:`
			- `core/hakmem_tiny_refill_p0.inc.h` (line 279: +1, lines 68-95: +debug logging)

			`---`

			`## Conclusion`

			`Status: ✅ PRODUCTION-READY`

			`The 1-line fix eliminates all SEGV crashes and counter corruption in fixed-size benchmarks. The fix is minimal, safe, and has been verified with 200K+ iterations across multiple runs.`

			Remaining Work: Investigate separate `bench_random_mixed` crash (unrelated to this fix).

			`---`

			`Reported by: User (Ultrathink request)`
			`Fixed by: Claude (Task Agent)`
			`Date: 2025-11-09`