Phase 19-1: Quick Prune (Frontend SLIM mode) - Experimental implementation

## Implementation
Added `HAKMEM_TINY_FRONT_SLIM=1` ENV gate to skip FastCache + SFC layers,
going straight to SLL (Single-Linked List) for direct backend access.

### Code Changes
**File**: `core/tiny_alloc_fast.inc.h` (lines 201-230)

Added early return gate in `tiny_alloc_fast_pop()`:
```c
// Phase 19-1: Quick Prune (Frontend SLIM mode)
static __thread int g_front_slim_checked = 0;
static __thread int g_front_slim_enabled = 0;

if (g_front_slim_enabled) {
    // Skip FastCache + SFC, go straight to SLL
    extern int g_tls_sll_enable;
    if (g_tls_sll_enable) {
        void* base = NULL;
        if (tls_sll_pop(class_idx, &base)) {
            g_front_sll_hit[class_idx]++;
            return base;  // SLL hit (SLIM fast path)
        }
    }
    return NULL;  // SLL miss → caller refills
}
// else: Existing FC → SFC → SLL cascade (unchanged)
```

### Design Rationale
**Goal**: Skip unused frontend layers to reduce branch misprediction overhead
**Strategy**: Based on ChatGPT-sensei analysis showing FC/SFC hit rates near 0%
**Expected**: 22M → 27-30M ops/s (+22-36%)

**Features**:
-  A/B testable via ENV (instant rollback: ENV=0)
-  Existing code unchanged (backward compatible)
-  TLS-cached enable check (amortized overhead)

---

## Performance Results

### Benchmark: Random Mixed 256B (1M iterations)

```
Baseline (SLIM OFF): 23.2M, 23.7M, 23.2M ops/s (avg: 23.4M)
Phase 19-1 (SLIM ON): 22.8M, 22.8M, 23.7M ops/s (avg: 23.1M)

Difference: -1.3% (within noise, no improvement) ⚠️
Expected:   +22-36% ← NOT achieved
```

### Stability Testing
-  100K short run: No SEGV, no crashes
-  1M iterations: Stable performance across 3 runs
-  Functional correctness: All allocations successful

---

## Analysis: Why Quick Prune Failed

### Hypothesis 1: FC/SFC Overhead Already Minimal
- FC/SFC checks are branch-predicted (miss path well-optimized)
- Skipping these layers provides negligible cycle savings
- Premise of "0% hit rate" may not reflect actual benefit of having layers

### Hypothesis 2: ENV Check Overhead Cancels Gains
- TLS variable initialization (`g_front_slim_checked`)
- `getenv()` call overhead on first allocation
- Cost of SLIM gate check == cost of skipping FC/SFC

### Hypothesis 3: Incorrect Premise
- Task-sensei's "FC/SFC hit rate 0%" assumption may be wrong
- Layers may provide cache locality benefits even with low hit rate
- Removing layers disrupts cache line prefetching

---

## Conclusion & Next Steps

**Phase 19-1 Status**:  Experimental - No performance improvement

**Key Learnings**:
1. Frontend layer pruning alone is insufficient
2. Branch prediction in existing code is already effective
3. Structural change (not just pruning) needed for significant gains

**Recommendation**: Proceed to Phase 19-2 (Front-V2 tcache single-layer)
- Phase 19-1 approach (pruning) = failed
- Phase 19-2 approach (structural redesign) = recommended
- Expected: 31ns → 15ns via tcache-style single TLS magazine

---

## ENV Usage

```bash
# Enable SLIM mode (experimental, no gain observed)
export HAKMEM_TINY_FRONT_SLIM=1
./bench_random_mixed_hakmem 1000000 256 42

# Disable SLIM mode (default, recommended)
unset HAKMEM_TINY_FRONT_SLIM
./bench_random_mixed_hakmem 1000000 256 42
```

---

## Files Modified
- `core/tiny_alloc_fast.inc.h` - Added Phase 19-1 Quick Prune gate

## Investigation Report
Task-sensei analysis documented entry point (`tiny_alloc_fast_pop()` line 176),
identified skip targets (FC: lines 208-220, SFC: lines 222-250), and confirmed
SLL as primary fast path (88-99% hit rate from prior analysis).

---

📝 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Task-sensei (tiny_alloc_fast.inc.h structure analysis)
Co-Authored-By: ChatGPT (Phase 19 strategy design)
This commit is contained in:
Moe Charm (CI)
2025-11-21 05:33:17 +09:00
parent 6baf63a1fb
commit 66a29783a4

View File

@ -198,6 +198,37 @@ static inline void* tiny_alloc_fast_pop(int class_idx) {
}
return NULL;
#else
// ========== Phase 19-1: Quick Prune (Frontend SLIM mode) ==========
// ENV: HAKMEM_TINY_FRONT_SLIM=1
// Goal: Skip FastCache + SFC layers, go straight to SLL (88-99% hit rate)
// Expected: 22M → 27-30M ops/s (+22-36%)
static __thread int g_front_slim_checked = 0;
static __thread int g_front_slim_enabled = 0;
if (__builtin_expect(!g_front_slim_checked, 0)) {
const char* e = getenv("HAKMEM_TINY_FRONT_SLIM");
g_front_slim_enabled = (e && *e && *e != '0') ? 1 : 0;
g_front_slim_checked = 1;
}
// SLIM MODE: Skip FastCache + SFC, go straight to SLL
if (__builtin_expect(g_front_slim_enabled, 0)) {
// Box Boundary: TLS SLL freelist pop (only layer in SLIM mode)
extern int g_tls_sll_enable;
if (__builtin_expect(g_tls_sll_enable, 1)) {
void* base = NULL;
if (tls_sll_pop(class_idx, &base)) {
// Front Gate: SLL hit (SLIM fast path - 3 instructions)
extern unsigned long long g_front_sll_hit[];
g_front_sll_hit[class_idx]++;
return base;
}
}
// SLIM mode miss → return NULL (caller refills)
return NULL;
}
// ========== End Phase 19-1: Quick Prune ==========
// Phase 7 Task 3: Profiling overhead removed in release builds
// In release mode, compiler can completely eliminate profiling code
#if !HAKMEM_BUILD_RELEASE