docs: Update CURRENT_TASK.md and claude.md for Phase 8 completion
Phase 8 Complete: BenchFast crash root cause fixes Documentation updates: 1. CURRENT_TASK.md: - Phase 8 complete (TLS→Atomic + Header write fixes) - 箱理論 root cause analysis (3 critical bugs) - Next phase recommendations (Option C: BenchFast pool expansion) - Detailed technical explanations for each layer 2. .claude/claude.md: - Phase 8 achievement summary - 箱理論 4-principle validation - Commit references (191e65983,da8f4d2c8) Key Fixes Documented: - TLS→Atomic: Cross-thread guard variable (pthread_once bug) - Header Write: Direct write bypasses P3 optimization (free routing) - Infrastructure Isolation: __libc_calloc for cache arrays - Design Fix: Removed unified_cache_init() call 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
@ -52,4 +52,24 @@ Current focus: Performance optimization and memory overhead reduction.
|
|||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
**Last Updated**: 2025-10-27
|
## Phase 8 Complete (2025-11-30)
|
||||||
|
|
||||||
|
**Achievement**: BenchFast crash root cause fixes (箱理論 analysis)
|
||||||
|
|
||||||
|
**Key Fixes**:
|
||||||
|
1. **TLS→Atomic**: Guard variable works across all threads (pthread_once bug)
|
||||||
|
2. **Header Write**: Direct write bypasses P3 optimization (free routing bug)
|
||||||
|
3. **Infrastructure Isolation**: __libc_calloc for Unified Cache arrays
|
||||||
|
4. **Design Fix**: Removed unified_cache_init() call (BenchFast uses TLS SLL, not UC)
|
||||||
|
|
||||||
|
**箱理論 Validation**:
|
||||||
|
- Single Responsibility: Guard protects entire process (not per-thread)
|
||||||
|
- Clear Contract: BenchFast always writes headers (explicit)
|
||||||
|
- Observable: Atomic variable visible across all threads
|
||||||
|
- Composable: Works with pthread_once() threading model
|
||||||
|
|
||||||
|
**Commits**: 191e65983, da8f4d2c8
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
**Last Updated**: 2025-11-30
|
||||||
|
|||||||
448
CURRENT_TASK.md
448
CURRENT_TASK.md
@ -1,28 +1,92 @@
|
|||||||
# Current Task: Phase 7 Complete - Next Steps
|
# Current Task: Phase 8 Complete - BenchFast Root Cause Fixes
|
||||||
|
|
||||||
**Date**: 2025-11-29
|
**Date**: 2025-11-30
|
||||||
**Status**: Phase 7 ✅ COMPLETE (Step 1-4)
|
**Status**: Phase 8 ✅ COMPLETE (Root Cause Fixes)
|
||||||
**Achievement**: Tiny Front Hot Path Unification + Dead Code Elimination (+55.5% total!)
|
**Achievement**: BenchFast crash investigation and fixes (TLS→Atomic + Header write)
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Phase 7 Complete! ✅
|
## Phase 8 Complete! ✅
|
||||||
|
|
||||||
**Result**: Tiny Front Hot Path Unification **COMPLETE** (Step 1-4)
|
**Result**: BenchFast crash root cause investigation and fixes **COMPLETE**
|
||||||
**Performance**: 52.3M → 81.5M ops/s (+55.5% improvement, +29.2M ops/s)
|
**Performance**: 16.3M ops/s (normal mode, working)
|
||||||
**Duration**: <1 day (extremely quick win!)
|
**Duration**: 1 day (investigation + fixes)
|
||||||
|
|
||||||
**Completed Steps**:
|
**Completed Steps**:
|
||||||
- ✅ Step 1: Branch hint reversal (0→1) - **+54.2% improvement**
|
- ✅ Layer 0: Limited prealloc to actual TLS SLL capacity (50,000 → 128 blocks/class)
|
||||||
- ✅ Step 2: Compile-time unified gate (PGO mode) - Code quality improvement
|
- ✅ Layer 1: Removed unnecessary unified_cache_init() call (design misunderstanding)
|
||||||
- ✅ Step 3: Config box integration - Dead code elimination infrastructure
|
- ✅ Layer 2: Infrastructure isolation (__libc_calloc for Unified Cache)
|
||||||
- ✅ Step 4: Macro replacement in hot path - **+1.1% additional improvement**
|
- ✅ Layer 3: Box Contract documentation (BenchFast uses TLS SLL, not UC)
|
||||||
|
- ✅ TLS→Atomic: Fixed cross-thread guard variable (pthread_once bug)
|
||||||
|
- ✅ Header Write: Direct write to bypass P3 optimization (free routing bug)
|
||||||
|
|
||||||
**Key Discovery** (from ChatGPT + Task agent analysis):
|
**Key Discoveries** (箱理論 Root Cause Analysis):
|
||||||
- Unified fast path existed but was marked UNLIKELY (`__builtin_expect(..., 0)`)
|
1. **Design Misunderstanding** (Layer 1): BenchFast uses TLS SLL directly, NOT Unified Cache
|
||||||
- Compiler optimized for legacy path, not unified cache path
|
- unified_cache_init() created 16KB mmap allocations
|
||||||
- malloc/free consumed 43% CPU due to branch misprediction
|
- Later freed via BenchFast → header misclassification → CRASH
|
||||||
- Simply reversing hint: **+54.2% improvement from 2 lines changed!**
|
2. **TLS Scope Bug** (Atomic Fix): `__thread int` doesn't work across threads
|
||||||
|
- pthread_once() creates new thread with fresh TLS (= 0)
|
||||||
|
- Guard broken → getenv() allocates via BenchFast → freed by __libc_free() → CRASH
|
||||||
|
3. **P3 Optimization Bug** (Header Fix): tiny_region_id_write_header() skips writes by default
|
||||||
|
- BenchFast free routing requires 0xa0-0xa7 magic header
|
||||||
|
- No header → __libc_free() tries to free HAKMEM pointer → CRASH
|
||||||
|
|
||||||
|
**箱理論 Validation**:
|
||||||
|
```
|
||||||
|
Single Responsibility: ✅ Guard protects entire process (not per-thread)
|
||||||
|
Clear Contract: ✅ BenchFast always writes headers (explicit)
|
||||||
|
Observable: ✅ Atomic variable visible across all threads
|
||||||
|
Composable: ✅ Works with pthread_once() and any threading model
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Commits
|
||||||
|
|
||||||
|
### Phase 8 Root Cause Fix
|
||||||
|
**Commit**: `191e65983`
|
||||||
|
**Date**: 2025-11-30
|
||||||
|
**Files**: 3 files, 36 insertions(+), 13 deletions(-)
|
||||||
|
|
||||||
|
**Changes**:
|
||||||
|
1. `bench_fast_box.c` (Layer 0 + Layer 1):
|
||||||
|
- Removed unified_cache_init() call (design misunderstanding)
|
||||||
|
- Limited prealloc to 128 blocks/class (actual TLS SLL capacity)
|
||||||
|
- Added root cause comments explaining why unified_cache_init() was wrong
|
||||||
|
|
||||||
|
2. `bench_fast_box.h` (Layer 3):
|
||||||
|
- Added Box Contract documentation (BenchFast uses TLS SLL, NOT UC)
|
||||||
|
- Documented scope separation (workload vs infrastructure allocations)
|
||||||
|
- Added contract violation example (Phase 8 bug explanation)
|
||||||
|
|
||||||
|
3. `tiny_unified_cache.c` (Layer 2):
|
||||||
|
- Changed calloc() → __libc_calloc() (infrastructure isolation)
|
||||||
|
- Changed free() → __libc_free() (symmetric cleanup)
|
||||||
|
- Added defensive fix comments explaining infrastructure bypass
|
||||||
|
|
||||||
|
### Phase 8-TLS-Fix
|
||||||
|
**Commit**: `da8f4d2c8`
|
||||||
|
**Date**: 2025-11-30
|
||||||
|
**Files**: 3 files, 21 insertions(+), 11 deletions(-)
|
||||||
|
|
||||||
|
**Changes**:
|
||||||
|
1. `bench_fast_box.c` (TLS→Atomic):
|
||||||
|
- Changed `__thread int bench_fast_init_in_progress` → `atomic_int g_bench_fast_init_in_progress`
|
||||||
|
- Added atomic_load() for reads, atomic_store() for writes
|
||||||
|
- Added root cause comments (pthread_once creates fresh TLS)
|
||||||
|
|
||||||
|
2. `bench_fast_box.h` (TLS→Atomic):
|
||||||
|
- Updated extern declaration to match atomic_int
|
||||||
|
- Added Phase 8-TLS-Fix comment explaining cross-thread safety
|
||||||
|
|
||||||
|
3. `bench_fast_box.c` (Header Write):
|
||||||
|
- Replaced `tiny_region_id_write_header()` → direct write `*(uint8_t*)base = 0xa0 | class_idx`
|
||||||
|
- Added Phase 8-P3-Fix comment explaining P3 optimization bypass
|
||||||
|
- Contract: BenchFast always writes headers (required for free routing)
|
||||||
|
|
||||||
|
4. `hak_wrappers.inc.h` (Atomic):
|
||||||
|
- Updated bench_fast_init_in_progress check to use atomic_load()
|
||||||
|
- Added Phase 8-TLS-Fix comment for cross-thread safety
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
@ -37,197 +101,181 @@ Phase 5 (Mid MT fix): 52.3 M ops/s (-8.6% regression)
|
|||||||
Phase 6 (Lock-free Mid MT): 42.1 M ops/s (Mid MT: +2.65%)
|
Phase 6 (Lock-free Mid MT): 42.1 M ops/s (Mid MT: +2.65%)
|
||||||
Phase 7-Step1 (Unified front): 80.6 M ops/s (+54.2%!) ⭐
|
Phase 7-Step1 (Unified front): 80.6 M ops/s (+54.2%!) ⭐
|
||||||
Phase 7-Step4 (Dead code): 81.5 M ops/s (+1.1%) ⭐⭐
|
Phase 7-Step4 (Dead code): 81.5 M ops/s (+1.1%) ⭐⭐
|
||||||
|
Phase 8 (Normal mode): 16.3 M ops/s (working, different workload)
|
||||||
|
|
||||||
Total improvement: +43.5% (56.8M → 81.5M) from Phase 3
|
Total improvement: +43.5% (56.8M → 81.5M) from Phase 3
|
||||||
```
|
```
|
||||||
|
|
||||||
### Benchmark Results Summary
|
**Note**: Phase 8 used different benchmark (10M iterations, ws=8192) vs Phase 7 (ws=256).
|
||||||
|
Normal mode performance: 16.3M ops/s (working, no crash).
|
||||||
**bench_random_mixed (16B-1KB, Tiny workload, ws=256)**:
|
|
||||||
```
|
|
||||||
Phase 7-Step1 (branch hint): 80.6 M ops/s (+54.2%)
|
|
||||||
Phase 7-Step2 (PGO mode): 80.3 M ops/s (-0.37%, noise)
|
|
||||||
Phase 7-Step3 (config box): 80.6 M ops/s (+0.37%, noise)
|
|
||||||
Phase 7-Step4 (macros): 81.5 M ops/s (+1.1%, dead code elimination!)
|
|
||||||
```
|
|
||||||
|
|
||||||
**bench_mid_mt_gap (1KB-8KB, Mid MT workload, ws=256)**:
|
|
||||||
```
|
|
||||||
After Phase 6-B: 42.09 M ops/s (1.57x vs system malloc)
|
|
||||||
```
|
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Technical Details
|
## Technical Details
|
||||||
|
|
||||||
### What Changed (Phase 7-Step1)
|
### Layer 0: Prealloc Capacity Fix
|
||||||
|
|
||||||
**File**: `core/box/hak_wrappers.inc.h`
|
**File**: `core/box/bench_fast_box.c`
|
||||||
**Lines**: 137 (malloc), 190 (free)
|
**Lines**: 131-148
|
||||||
|
|
||||||
|
**Root Cause**:
|
||||||
|
- Old code preallocated 50,000 blocks/class
|
||||||
|
- TLS SLL actual capacity: 128 blocks (adaptive sizing limit)
|
||||||
|
- Lost blocks (beyond 128) caused heap corruption
|
||||||
|
|
||||||
|
**Fix**:
|
||||||
```c
|
```c
|
||||||
// Before (Phase 26):
|
// Before:
|
||||||
if (__builtin_expect(TINY_FRONT_UNIFIED_GATE_ENABLED, 0)) { // UNLIKELY
|
const uint32_t PREALLOC_COUNT = 50000; // Too large!
|
||||||
// Unified fast path...
|
|
||||||
}
|
|
||||||
|
|
||||||
// After (Phase 7-Step1):
|
// After:
|
||||||
if (__builtin_expect(TINY_FRONT_UNIFIED_GATE_ENABLED, 1)) { // LIKELY
|
const uint32_t ACTUAL_TLS_SLL_CAPACITY = 128; // Observed actual capacity
|
||||||
// Unified fast path...
|
for (int cls = 2; cls <= 7; cls++) {
|
||||||
|
uint32_t capacity = ACTUAL_TLS_SLL_CAPACITY;
|
||||||
|
for (int i = 0; i < (int)capacity; i++) {
|
||||||
|
// preallocate...
|
||||||
|
}
|
||||||
}
|
}
|
||||||
```
|
```
|
||||||
|
|
||||||
### Why This Works
|
### Layer 1: Design Misunderstanding Fix
|
||||||
|
|
||||||
1. **Branch Prediction**: CPU now expects unified path (not legacy path)
|
**File**: `core/box/bench_fast_box.c`
|
||||||
2. **Cache Locality**: Unified path stays hot in instruction cache
|
**Lines**: 123-128 (REMOVED)
|
||||||
3. **Code Layout**: Compiler places unified path inline (legacy path cold)
|
|
||||||
4. **perf Data**: malloc/free consumed 43% CPU → optimized to hot path
|
|
||||||
|
|
||||||
### Phase 7-Step2 (PGO Mode)
|
**Root Cause**:
|
||||||
|
- BenchFast uses TLS SLL directly (g_tls_sll[])
|
||||||
|
- Unified Cache is NOT used by BenchFast
|
||||||
|
- unified_cache_init() created 16KB allocations (infrastructure)
|
||||||
|
- Later freed by BenchFast → header misclassification → CRASH
|
||||||
|
|
||||||
**File**: `Makefile`
|
**Fix**:
|
||||||
**Line**: 606
|
|
||||||
|
|
||||||
```make
|
|
||||||
# Added -DHAKMEM_TINY_FRONT_PGO=1 for bench builds
|
|
||||||
bench_random_mixed_hakmem.o: bench_random_mixed.c hakmem.h
|
|
||||||
$(CC) $(CFLAGS) -DUSE_HAKMEM -DHAKMEM_TINY_FRONT_PGO=1 -c -o $@ $<
|
|
||||||
```
|
|
||||||
|
|
||||||
**Effect**: `TINY_FRONT_UNIFIED_GATE_ENABLED = 1` (compile-time constant)
|
|
||||||
- Enables dead code elimination: `if (1) { ... }` → always taken
|
|
||||||
- No performance change (Step 1 already optimized path)
|
|
||||||
- Code quality improvement (foundation for Step 3-7)
|
|
||||||
|
|
||||||
### Phase 7-Step3 (Config Box Integration)
|
|
||||||
|
|
||||||
**File**: `core/tiny_alloc_fast.inc.h`
|
|
||||||
**Lines**: 25 (include), 33-41 (wrapper functions)
|
|
||||||
|
|
||||||
**Changes**:
|
|
||||||
1. Include `box/tiny_front_config_box.h` - Dual-mode configuration infrastructure
|
|
||||||
2. Add wrapper functions for missing config macros:
|
|
||||||
```c
|
|
||||||
static inline int tiny_fastcache_enabled(void) {
|
|
||||||
extern int g_fastcache_enable;
|
|
||||||
return g_fastcache_enable;
|
|
||||||
}
|
|
||||||
|
|
||||||
static inline int sfc_cascade_enabled(void) {
|
|
||||||
extern int g_sfc_enabled;
|
|
||||||
return g_sfc_enabled;
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
**Effect**: Dead code elimination infrastructure in place
|
|
||||||
- Normal mode: Config macros → runtime function calls (backward compatible)
|
|
||||||
- PGO mode: Config macros → compile-time constants (dead code elimination)
|
|
||||||
- No performance change (infrastructure only, not used yet)
|
|
||||||
- Foundation for Steps 4-7 (replace runtime checks with macros)
|
|
||||||
|
|
||||||
**Config Box Dual-Mode Design**:
|
|
||||||
```c
|
```c
|
||||||
// PGO Mode (-DHAKMEM_TINY_FRONT_PGO=1):
|
// REMOVED:
|
||||||
#define TINY_FRONT_FASTCACHE_ENABLED 0 // Compile-time constant
|
// unified_cache_init(); // WRONG! BenchFast uses TLS SLL, not Unified Cache
|
||||||
#define TINY_FRONT_HEAP_V2_ENABLED 0 // Compile-time constant
|
|
||||||
#define TINY_FRONT_ULTRA_SLIM_ENABLED 0 // Compile-time constant
|
|
||||||
|
|
||||||
// Normal Mode (default):
|
// Added comment:
|
||||||
#define TINY_FRONT_FASTCACHE_ENABLED tiny_fastcache_enabled() // Runtime check
|
// Phase 8 Root Cause Fix: REMOVED unified_cache_init() call
|
||||||
#define TINY_FRONT_HEAP_V2_ENABLED tiny_heap_v2_enabled() // Runtime check
|
// Reason: BenchFast uses TLS SLL directly, NOT Unified Cache
|
||||||
#define TINY_FRONT_ULTRA_SLIM_ENABLED ultra_slim_mode_enabled() // Runtime check
|
|
||||||
```
|
```
|
||||||
|
|
||||||
### Phase 7-Step4 (Macro Replacement)
|
### Layer 2: Infrastructure Isolation
|
||||||
|
|
||||||
**File**: `core/tiny_alloc_fast.inc.h`
|
**File**: `core/front/tiny_unified_cache.c`
|
||||||
**Lines**: 421, 757, 809 (3 hot path checks)
|
**Lines**: 61-71 (init), 103-109 (shutdown)
|
||||||
|
|
||||||
**Changes**:
|
**Strategy**: Dual-Path Separation
|
||||||
Replace runtime checks with config macros for dead code elimination:
|
- **Workload allocations** (measured): HAKMEM paths (TLS SLL, Unified Cache)
|
||||||
|
- **Infrastructure allocations** (unmeasured): __libc_calloc/__libc_free
|
||||||
|
|
||||||
|
**Fix**:
|
||||||
```c
|
```c
|
||||||
// Line 421: FastCache check
|
|
||||||
// Before:
|
// Before:
|
||||||
if (__builtin_expect(g_fastcache_enable && class_idx <= 3, 1)) {
|
g_unified_cache[cls].slots = (void**)calloc(cap, sizeof(void*));
|
||||||
// After:
|
|
||||||
if (__builtin_expect(TINY_FRONT_FASTCACHE_ENABLED && class_idx <= 3, 1)) {
|
|
||||||
|
|
||||||
// Line 809: Heap V2 check
|
|
||||||
// Before:
|
|
||||||
if (__builtin_expect(tiny_heap_v2_enabled() && front_prune_heapv2_enabled() && class_idx <= 3, 0)) {
|
|
||||||
// After:
|
// After:
|
||||||
if (__builtin_expect(TINY_FRONT_HEAP_V2_ENABLED && front_prune_heapv2_enabled() && class_idx <= 3, 0)) {
|
extern void* __libc_calloc(size_t, size_t);
|
||||||
|
g_unified_cache[cls].slots = (void**)__libc_calloc(cap, sizeof(void*));
|
||||||
// Line 757: Ultra SLIM check
|
|
||||||
// Before:
|
|
||||||
if (__builtin_expect(ultra_slim_mode_enabled(), 0)) {
|
|
||||||
// After:
|
|
||||||
if (__builtin_expect(TINY_FRONT_ULTRA_SLIM_ENABLED, 0)) {
|
|
||||||
```
|
```
|
||||||
|
|
||||||
**Effect**: Dead code elimination in PGO mode
|
### Layer 3: Box Contract Documentation
|
||||||
- PGO mode (`-DHAKMEM_TINY_FRONT_PGO=1`):
|
|
||||||
- `if (0 && ...) { ... }` → entire block removed by compiler
|
|
||||||
- Smaller code size, better instruction cache locality
|
|
||||||
- Fewer branches in hot path
|
|
||||||
- Normal mode (default):
|
|
||||||
- `if (g_fastcache_enable && ...) { ... }` → runtime check preserved
|
|
||||||
- Full backward compatibility with ENV variables
|
|
||||||
|
|
||||||
**Performance Impact**:
|
**File**: `core/box/bench_fast_box.h`
|
||||||
- Before: 80.6 M ops/s (Phase 7-Step3)
|
**Lines**: 13-51
|
||||||
- After: 81.0 / 81.0 / 82.4 M ops/s (3 runs)
|
|
||||||
- Average: 81.5 M ops/s (+1.1%, +0.9 M ops/s)
|
|
||||||
|
|
||||||
**Dead Code Eliminated**:
|
**Added Documentation**:
|
||||||
1. FastCache path (C0-C3): `fastcache_pop()` call + hit/miss tracking
|
- BenchFast uses TLS SLL, NOT Unified Cache
|
||||||
2. Heap V2 path: `tiny_heap_v2_alloc_by_class()` + metrics
|
- Scope separation (workload vs infrastructure)
|
||||||
3. Ultra SLIM path: `ultra_slim_alloc_with_refill()` early return
|
- Preconditions and guarantees
|
||||||
|
- Contract violation example (Phase 8 bug)
|
||||||
|
|
||||||
|
### TLS→Atomic Fix
|
||||||
|
|
||||||
|
**File**: `core/box/bench_fast_box.c`
|
||||||
|
**Lines**: 22-27 (declaration), 37, 124, 215 (usage)
|
||||||
|
|
||||||
|
**Root Cause**:
|
||||||
|
```
|
||||||
|
pthread_once() → creates new thread
|
||||||
|
New thread has fresh TLS (bench_fast_init_in_progress = 0)
|
||||||
|
Guard broken → getenv() allocates → freed by __libc_free() → CRASH
|
||||||
|
```
|
||||||
|
|
||||||
|
**Fix**:
|
||||||
|
```c
|
||||||
|
// Before (TLS - broken):
|
||||||
|
__thread int bench_fast_init_in_progress = 0;
|
||||||
|
if (__builtin_expect(bench_fast_init_in_progress, 0)) { ... }
|
||||||
|
|
||||||
|
// After (Atomic - fixed):
|
||||||
|
atomic_int g_bench_fast_init_in_progress = 0;
|
||||||
|
if (__builtin_expect(atomic_load(&g_bench_fast_init_in_progress), 0)) { ... }
|
||||||
|
```
|
||||||
|
|
||||||
|
**箱理論 Validation**:
|
||||||
|
- **Responsibility**: Guard must protect entire process (not per-thread)
|
||||||
|
- **Contract**: "No BenchFast allocations during init" (all threads)
|
||||||
|
- **Observable**: Atomic variable visible across all threads
|
||||||
|
- **Composable**: Works with pthread_once() threading model
|
||||||
|
|
||||||
|
### Header Write Fix
|
||||||
|
|
||||||
|
**File**: `core/box/bench_fast_box.c`
|
||||||
|
**Lines**: 70-80
|
||||||
|
|
||||||
|
**Root Cause**:
|
||||||
|
- P3 optimization: tiny_region_id_write_header() skips header writes by default
|
||||||
|
- BenchFast free routing checks header magic (0xa0-0xa7)
|
||||||
|
- No header → free() misroutes to __libc_free() → CRASH
|
||||||
|
|
||||||
|
**Fix**:
|
||||||
|
```c
|
||||||
|
// Before (broken - calls function that skips write):
|
||||||
|
tiny_region_id_write_header(base, class_idx);
|
||||||
|
return (void*)((char*)base + 1);
|
||||||
|
|
||||||
|
// After (fixed - direct write):
|
||||||
|
*(uint8_t*)base = (uint8_t)(0xa0 | (class_idx & 0x0f)); // Direct write
|
||||||
|
return (void*)((char*)base + 1);
|
||||||
|
```
|
||||||
|
|
||||||
|
**Contract**: BenchFast always writes headers (required for free routing)
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Next Phase Options (from Task Agent Plan)
|
## Next Phase Options
|
||||||
|
|
||||||
### Option A: Continue Phase 7 (Steps 3-7) 📦
|
### Option A: Continue Phase 7 (Steps 5-7) 📦
|
||||||
**Goal**: Remove legacy layers (FastCache/SFC/HeapV2/TLS SLL)
|
**Goal**: Remove remaining legacy layers (complete dead code elimination)
|
||||||
**Expected**: Additional +5-10% via dead code elimination
|
**Expected**: Additional +3-5% via further code cleanup
|
||||||
**Duration**: 2-3 days (systematic removal)
|
**Duration**: 1-2 days
|
||||||
**Risk**: Medium (might break backward compatibility)
|
**Risk**: Low (infrastructure already in place)
|
||||||
|
|
||||||
**Completed Steps**:
|
**Remaining Steps**:
|
||||||
- ✅ Step 3: Config box integration (infrastructure ready)
|
|
||||||
|
|
||||||
**Remaining Steps** (from Task agent, updated):
|
|
||||||
- Step 4: Replace runtime checks with config macros in hot path (~20 lines)
|
|
||||||
- Replace `g_fastcache_enable` → `TINY_FRONT_FASTCACHE_ENABLED`
|
|
||||||
- Replace `tiny_heap_v2_enabled()` → `TINY_FRONT_HEAP_V2_ENABLED`
|
|
||||||
- Replace `ultra_slim_mode_enabled()` → `TINY_FRONT_ULTRA_SLIM_ENABLED`
|
|
||||||
- Step 5: Compile library with PGO flag (Makefile change)
|
- Step 5: Compile library with PGO flag (Makefile change)
|
||||||
- Step 6: Verify dead code elimination in assembly
|
- Step 6: Verify dead code elimination in assembly
|
||||||
- Step 7: Measure performance improvement (+5-10% expected)
|
- Step 7: Measure performance improvement
|
||||||
|
|
||||||
**Total**: ~20 lines of code changes + Makefile update
|
### Option B: PGO Re-enablement 🚀
|
||||||
|
|
||||||
### Option B: Investigate Phase 5 Regression 🔍
|
|
||||||
**Goal**: Understand -8.6% regression (57.2M → 52.3M before Phase 7)
|
|
||||||
**Note**: Now irrelevant (Phase 7 exceeded Phase 4 performance!)
|
|
||||||
**Status**: ✅ RESOLVED by Phase 7 (+54.2% masks the -8.6%)
|
|
||||||
|
|
||||||
### Option C: PGO Re-enablement 🚀
|
|
||||||
**Goal**: Re-enable PGO workflow from Phase 4-Step1
|
**Goal**: Re-enable PGO workflow from Phase 4-Step1
|
||||||
**Expected**: +6-13% cumulative (on top of 80.6M)
|
**Expected**: +6-13% cumulative (on top of 81.5M)
|
||||||
**Duration**: 2-3 days (resolve build issues)
|
**Duration**: 2-3 days
|
||||||
**Risk**: Low (proven pattern)
|
**Risk**: Low (proven pattern)
|
||||||
|
|
||||||
**Phase 4 PGO Results** (reference):
|
|
||||||
- Before: 57.0 M ops/s
|
|
||||||
- After PGO: 60.6 M ops/s (+6.25%)
|
|
||||||
|
|
||||||
**Current projection**:
|
**Current projection**:
|
||||||
- Phase 7 baseline: 80.6 M ops/s
|
- Phase 7 baseline: 81.5 M ops/s
|
||||||
- With PGO: ~85-91 M ops/s (+6-13%)
|
- With PGO: ~86-93 M ops/s (+6-13%)
|
||||||
|
|
||||||
|
### Option C: BenchFast Pool Expansion 🏎️
|
||||||
|
**Goal**: Increase BenchFast pool size for full 10M iteration support
|
||||||
|
**Expected**: Structural ceiling measurement (30-40M ops/s target)
|
||||||
|
**Duration**: 1 day
|
||||||
|
**Risk**: Low (just increase prealloc count)
|
||||||
|
|
||||||
|
**Current status**:
|
||||||
|
- Pool: 128 blocks/class (768 total)
|
||||||
|
- Exhaustion: C6/C7 exhaust after ~200 iterations
|
||||||
|
- Need: ~10,000 blocks/class for 10M iterations (60,000 total)
|
||||||
|
|
||||||
### Option D: Production Readiness 📊
|
### Option D: Production Readiness 📊
|
||||||
**Goal**: Comprehensive benchmark suite, deployment guide
|
**Goal**: Comprehensive benchmark suite, deployment guide
|
||||||
@ -235,72 +283,61 @@ if (__builtin_expect(TINY_FRONT_ULTRA_SLIM_ENABLED, 0)) {
|
|||||||
**Duration**: 3-5 days
|
**Duration**: 3-5 days
|
||||||
**Risk**: Low (documentation + testing)
|
**Risk**: Low (documentation + testing)
|
||||||
|
|
||||||
### Option E: Multi-threaded Optimization 🔀
|
|
||||||
**Goal**: Optimize for multi-threaded workloads
|
|
||||||
**Expected**: Improved MT scalability
|
|
||||||
**Duration**: 4-6 days (need MT benchmarks first)
|
|
||||||
**Risk**: High (no MT benchmark exists yet)
|
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Recommendation
|
## Recommendation
|
||||||
|
|
||||||
### Top Pick: **Option C (PGO Re-enablement)** 🚀
|
### Top Pick: **Option C (BenchFast Pool Expansion)** 🏎️
|
||||||
|
|
||||||
**Reasoning**:
|
**Reasoning**:
|
||||||
1. **Phase 7 success**: 80.6M ops/s is excellent baseline for PGO
|
1. **Phase 8 fixes working**: TLS→Atomic + Header write proven
|
||||||
2. **Known benefit**: +6.25% proven in Phase 4-Step1
|
2. **Quick win**: Just increase ACTUAL_TLS_SLL_CAPACITY to 10,000
|
||||||
3. **Low risk**: Just fix build issue (`__gcov_merge_time_profile` error)
|
3. **Scientific value**: Measure true structural ceiling (no safety costs)
|
||||||
4. **Quick win**: 2-3 days vs 2-3 days for Phase 7-Step3+
|
4. **Low risk**: 1-day task, no code changes (just capacity tuning)
|
||||||
5. **Cumulative**: Would stack with current 80.6M baseline
|
5. **Data-driven**: Enables comparison vs normal mode (16.3M vs 30-40M expected)
|
||||||
|
|
||||||
**Expected Result**:
|
**Expected Result**:
|
||||||
```
|
```
|
||||||
Phase 7 baseline: 80.6 M ops/s
|
Normal mode: 16.3 M ops/s (current)
|
||||||
With PGO: ~85-91 M ops/s (+6-13%)
|
BenchFast mode: 30-40 M ops/s (target, 2-2.5x faster)
|
||||||
```
|
```
|
||||||
|
|
||||||
**Fallback**: If PGO fix takes >3 days, switch to Option A (Phase 7-Step3+)
|
**Implementation**:
|
||||||
|
```c
|
||||||
|
// core/box/bench_fast_box.c:140
|
||||||
|
const uint32_t ACTUAL_TLS_SLL_CAPACITY = 10000; // Was 128
|
||||||
|
```
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
### Second Choice: **Option A (Continue Phase 7-Step3+)** 📦
|
### Second Choice: **Option B (PGO Re-enablement)** 🚀
|
||||||
|
|
||||||
**Reasoning**:
|
**Reasoning**:
|
||||||
1. **Momentum**: Phase 7-Step1+2 already done, Step 3-7 is natural continuation
|
1. **Proven benefit**: +6.25% in Phase 4-Step1
|
||||||
2. **Clear path**: Task agent provided detailed 5-step plan
|
2. **Cumulative**: Would stack with Phase 7 (81.5M baseline)
|
||||||
3. **Predictable**: Expected +5-10% additional improvement
|
3. **Low risk**: Just fix build issue
|
||||||
4. **Code cleanup**: Removes legacy layers (FastCache/SFC/HeapV2)
|
4. **High impact**: ~86-93 M ops/s projected
|
||||||
|
|
||||||
**Expected Result**:
|
|
||||||
```
|
|
||||||
Phase 7-Step1+2: 80.6 M ops/s
|
|
||||||
Phase 7-Step3-7: ~84-89 M ops/s (+5-10%)
|
|
||||||
```
|
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Current Performance Summary
|
## Current Performance Summary
|
||||||
|
|
||||||
### bench_random_mixed (16B-1KB, Tiny workload, ws=256)
|
### bench_random_mixed (16B-1KB, Tiny workload)
|
||||||
```
|
```
|
||||||
Phase 3 (mincore removal): 56.8 M ops/s
|
Phase 7-Step4 (ws=256): 81.5 M ops/s (+55.5% total)
|
||||||
Phase 4 (Hot/Cold Box): 57.2 M ops/s (+0.7%)
|
Phase 8 (ws=8192): 16.3 M ops/s (normal mode, working)
|
||||||
Phase 5 (Mid MT fix): 52.3 M ops/s (-8.6%)
|
|
||||||
Phase 7 (Unified front): 80.6 M ops/s (+54.2%!) ⭐
|
|
||||||
```
|
```
|
||||||
|
|
||||||
### bench_mid_mt_gap (1KB-8KB, Mid MT workload, ws=256)
|
### bench_mid_mt_gap (1KB-8KB, Mid MT workload, ws=256)
|
||||||
```
|
```
|
||||||
Before Phase 5 (broken): 1.49 M ops/s
|
|
||||||
After Phase 5 (fixed): 41.0 M ops/s (+28.9x)
|
|
||||||
After Phase 6-B (lock-free): 42.09 M ops/s (+2.65%)
|
After Phase 6-B (lock-free): 42.09 M ops/s (+2.65%)
|
||||||
vs System malloc: 26.8 M ops/s (1.57x faster)
|
vs System malloc: 26.8 M ops/s (1.57x faster)
|
||||||
```
|
```
|
||||||
|
|
||||||
### Overall Status
|
### Overall Status
|
||||||
- ✅ **Tiny allocations** (16B-1KB): **80.6 M ops/s** (excellent, +54.2% vs Phase 5!)
|
- ✅ **Tiny allocations** (16B-1KB): **81.5 M ops/s** (excellent, +55.5%!)
|
||||||
- ✅ **Mid MT allocations** (1KB-8KB): 42 M ops/s (excellent, 1.57x vs system, lock-free)
|
- ✅ **Mid MT allocations** (1KB-8KB): 42 M ops/s (excellent, 1.57x vs system)
|
||||||
|
- ✅ **BenchFast mode**: No crash (TLS→Atomic + Header fix working)
|
||||||
- ⏸️ **Large allocations** (32KB-2MB): Not benchmarked yet
|
- ⏸️ **Large allocations** (32KB-2MB): Not benchmarked yet
|
||||||
- ⏸️ **MT workloads**: No MT benchmarks yet
|
- ⏸️ **MT workloads**: No MT benchmarks yet
|
||||||
|
|
||||||
@ -309,17 +346,16 @@ vs System malloc: 26.8 M ops/s (1.57x faster)
|
|||||||
## Decision Time
|
## Decision Time
|
||||||
|
|
||||||
**Choose your next phase**:
|
**Choose your next phase**:
|
||||||
- **Option A**: Continue Phase 7 (Steps 3-7, legacy removal)
|
- **Option A**: Continue Phase 7 (Steps 5-7, final cleanup)
|
||||||
- **Option B**: ~~Investigate regression~~ (RESOLVED by Phase 7)
|
- **Option B**: PGO re-enablement (recommended for normal builds)
|
||||||
- **Option C**: PGO re-enablement (recommended)
|
- **Option C**: BenchFast pool expansion (recommended for ceiling measurement)
|
||||||
- **Option D**: Production readiness & benchmarking
|
- **Option D**: Production readiness & benchmarking
|
||||||
- **Option E**: Multi-threaded optimization
|
|
||||||
|
|
||||||
**Or**: Celebrate Phase 7 success! 🎉 (+54.2% is huge!)
|
**Or**: Celebrate Phase 8 success! 🎉 (Root cause fixes complete!)
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
Updated: 2025-11-29
|
Updated: 2025-11-30
|
||||||
Phase: 7 COMPLETE (Step 1-2) → 8 PENDING
|
Phase: 8 COMPLETE (Root Cause Fixes) → 9 PENDING
|
||||||
Previous: Phase 6 (Lock-free Mid MT, +2.65%)
|
Previous: Phase 7 (Tiny Front Unification, +55.5%)
|
||||||
Achievement: Tiny Front Unification (80.6M ops/s, +54.2% improvement!)
|
Achievement: BenchFast crash investigation and fixes (箱理論 root cause analysis!)
|
||||||
|
|||||||
Reference in New Issue
Block a user