Files
hakmem/BUG_FLOW_DIAGRAM.md

233 lines
14 KiB
Markdown
Raw Normal View History

# Active Counter Double-Decrement Bug - Visual Flow Diagram
## Bug Flow Trace (Single Block Lifecycle)
```
┌─────────────────────────────────────────────────────────────────────┐
│ Thread A: Initial Allocation (Linear Mode) │
├─────────────────────────────────────────────────────────────────────┤
│ File: tiny_superslab_alloc.inc.h:463-472 │
│ │
│ meta->used++ │
│ ss_active_inc(tls->ss) ← active = 100 ✅ │
│ return block │
│ │
│ State: Block allocated, counter = 100 │
└─────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────┐
│ Thread B: Cross-Thread Free │
├─────────────────────────────────────────────────────────────────────┤
│ File: hakmem_tiny_superslab.h:292-416 (ss_remote_push) │
│ │
│ ss_active_dec_one(ss) ← active = 99 ✅ │
│ Push block to remote queue │
│ │
│ State: Block in remote queue, counter = 99 │
└─────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────┐
│ Thread A: Remote Drain │
├─────────────────────────────────────────────────────────────────────┤
│ File: hakmem_tiny_superslab.h:421-529 │
│ (_ss_remote_drain_to_freelist_unsafe) │
│ │
│ meta->freelist = chain_head │
│ (NO counter change) ← active = 99 ✅ │
│ Comment: "no change to used/active; already adjusted at free" │
│ │
│ State: Block in meta->freelist, counter = 99 │
└─────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────┐
│ Thread A: P0 Batch Refill ⚠️ BUG HERE! ⚠️ │
├─────────────────────────────────────────────────────────────────────┤
│ File: hakmem_tiny_refill_p0.inc.h:99-109 │
│ │
│ from_freelist = trc_pop_from_freelist(meta, want, &chain) │
│ trc_splice_to_sll(..., &g_tls_sll_head[class_idx], ...) │
│ │
│ ❌ MISSING: ss_active_add(tls->ss, from_freelist) │
│ (NO counter change) ← active = 99 ❌ SHOULD BE 100! │
│ │
│ Comment (WRONG): "from_freelist は既に used/active 計上済み" │
│ "freelist items already counted" │
│ │
│ State: Block in TLS SLL, counter = 99 (WRONG!) │
└─────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────┐
│ Thread A: Allocation from TLS SLL │
├─────────────────────────────────────────────────────────────────────┤
│ File: tiny_alloc_fast.inc.h:145-210 (tiny_alloc_fast_pop) │
│ │
│ ptr = g_tls_sll_head[class_idx] │
│ g_tls_sll_head[class_idx] = *(void**)ptr │
│ (NO counter change - correct for TLS cache) │
│ ← active = 99 (still wrong) │
│ return ptr │
│ │
│ State: Block allocated, counter = 99 (WRONG! Should be 100) │
└─────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────┐
│ Thread A: Same-Thread Free ⚠️ DOUBLE-DECREMENT! ⚠️ │
├─────────────────────────────────────────────────────────────────────┤
│ File: tiny_free_fast.inc.h:91-145 (tiny_free_fast_ss) │
│ │
│ tiny_alloc_fast_push(class_idx, ptr) ← Push to TLS cache │
│ ss_active_dec_one(ss) ← active = 98 ❌ DOUBLE DEC! │
│ │
│ State: Block in TLS cache, counter = 98 (WRONG! Should be 99) │
│ │
│ ⚠️ BUG RESULT: Counter decremented TWICE (steps 2 and 6) │
│ but only incremented ONCE (step 1) │
│ Net effect: -1 per cycle → underflow → OOM │
└─────────────────────────────────────────────────────────────────────┘
```
---
## Counter State Timeline
```
Step Action Active Counter Expected Status
────────────────────────────────────────────────────────────────────────
1 Linear allocation 100 100 ✅
2 Cross-thread free 99 99 ✅
3 Remote drain 99 99 ✅
4 P0 batch refill (BUG!) 99 100 ❌
5 Alloc from TLS SLL 99 100 ❌
6 Same-thread free (DOUBLE!) 98 99 ❌
────────────────────────────────────────────────────────────────────────
Net: -2 decrements, -1 increment = -1 error per cycle
```
---
## Cascade Effect (100 blocks, heavy cross-thread activity)
```
Cycle Active Counter State
─────────────────────────────────────────────────────────
0 100 Initial
1 99 After 1 cycle (should be 100)
2 98 After 2 cycles
... ... ...
99 1 After 99 cycles
100 0 UNDERFLOW!
101 UINT32_MAX Counter wraps around
─────────────────────────────────────────────────────────
Result after underflow:
• SuperSlab appears "full" (active = UINT32_MAX)
• superslab_refill() can't reuse slabs
• Registry adoption fails
• Must allocate new SuperSlabs → OOM
• Corrupted state → "free(): invalid pointer"
```
---
## Comparison: Direct Freelist Allocation (CORRECT)
```
┌─────────────────────────────────────────────────────────────────────┐
│ Thread A: Direct Allocation from Freelist ✅ │
├─────────────────────────────────────────────────────────────────────┤
│ File: tiny_superslab_alloc.inc.h:475-508 │
│ │
│ void* block = meta->freelist │
│ meta->freelist = *(void**)block │
│ meta->used++ │
│ ss_active_inc(tls->ss) ← active++ ✅ CORRECT! │
│ HAK_RET_ALLOC(class_idx, block) │
│ │
│ State: Block allocated, counter incremented (correct!) │
└─────────────────────────────────────────────────────────────────────┘
This path CORRECTLY increments the counter because it understands:
1. Freelist blocks were freed (counter decremented)
2. Allocating from freelist → must increment counter
3. Net effect: counter stays balanced ✅
P0 batch refill must follow the same protocol!
```
---
## The Fix
```diff
File: core/hakmem_tiny_refill_p0.inc.h
Lines: 99-109
while (want > 0) {
TinyRefillChain chain;
uint32_t from_freelist = trc_pop_from_freelist(meta, want, &chain);
if (from_freelist > 0) {
trc_splice_to_sll(class_idx, &chain, &g_tls_sll_head[class_idx], &g_tls_sll_count[class_idx]);
- // NOTE: from_freelist は既に used/active 計上済みのブロックの再循環。
- // nonempty_mask クリアは不要クリアすると後続freeで立たない
+ // FIX (2025-11-07): Blocks from freelist were decremented when freed.
+ // Must increment counter when moving back to allocation pool (TLS SLL).
+ ss_active_add(tls->ss, from_freelist);
extern unsigned long long g_rf_freelist_items[];
g_rf_freelist_items[class_idx] += from_freelist;
...
}
}
```
**Why this fixes the bug:**
1. Freelist blocks are "free" (counter was decremented when freed)
2. TLS SLL blocks are "allocated" (will be returned to user without counter change)
3. Moving from freelist to TLS SLL = moving from "free" to "allocated"
4. Therefore: **counter must be incremented**
This matches the protocol used by direct freelist allocation (line 508).
---
## Why Debug Hooks Mask the Bug
```
Normal Mode (Bug Visible):
• Fast paths enabled
• P0 batch refill active
• High cross-thread free frequency
• Rapid counter underflow → crash in seconds
Debug Mode (Bug Hidden):
• Slower code paths
• Different timing/scheduling
• Reduced cross-thread free frequency
• P0 batch refill less frequent or disabled
• Bug accumulates slowly → may not manifest in test duration
```
---
## Related Files
### Counter Management
- `core/hakmem_tiny.c:177-182` - `ss_active_add()`, `ss_active_inc()`
- `core/hakmem_tiny_superslab.h:189-199` - `ss_active_dec_one()`
### Bug Location
- **`core/hakmem_tiny_refill_p0.inc.h:99-109`** ⚠️ BUG HERE
### Correct Examples
- `core/tiny_superslab_alloc.inc.h:475-508` - Direct freelist alloc (✅ correct)
- `core/tiny_superslab_alloc.inc.h:463-472` - Linear alloc (✅ correct)
### Free Paths (All Correct)
- `core/tiny_free_fast.inc.h:91-145` - Same-thread free (✅)
- `core/hakmem_tiny_superslab.h:292-416` - Cross-thread free (✅)
- `core/hakmem_tiny_superslab.h:421-529` - Remote drain (✅ no change, correct)
---
**Summary:** The bug is a classic double-decrement caused by missing counter increment in P0 batch refill when moving blocks from freelist (free state) to TLS SLL (allocated state).