diff --git a/CURRENT_TASK.md b/CURRENT_TASK.md index f118edab..bcc5888c 100644 --- a/CURRENT_TASK.md +++ b/CURRENT_TASK.md @@ -1,21 +1,22 @@ # Current Task: Phase 7 Complete - Next Steps **Date**: 2025-11-29 -**Status**: Phase 7 ✅ COMPLETE (Step 1-3) -**Achievement**: Tiny Front Hot Path Unification (+54.2% improvement!) +**Status**: Phase 7 ✅ COMPLETE (Step 1-4) +**Achievement**: Tiny Front Hot Path Unification + Dead Code Elimination (+55.5% total!) --- ## Phase 7 Complete! ✅ -**Result**: Tiny Front Hot Path Unification **COMPLETE** (Step 1-3) -**Performance**: 52.3M → 80.6M ops/s (+54.2% improvement, +28.3M ops/s) +**Result**: Tiny Front Hot Path Unification **COMPLETE** (Step 1-4) +**Performance**: 52.3M → 81.5M ops/s (+55.5% improvement, +29.2M ops/s) **Duration**: <1 day (extremely quick win!) **Completed Steps**: - ✅ Step 1: Branch hint reversal (0→1) - **+54.2% improvement** - ✅ Step 2: Compile-time unified gate (PGO mode) - Code quality improvement - ✅ Step 3: Config box integration - Dead code elimination infrastructure +- ✅ Step 4: Macro replacement in hot path - **+1.1% additional improvement** **Key Discovery** (from ChatGPT + Task agent analysis): - Unified fast path existed but was marked UNLIKELY (`__builtin_expect(..., 0)`) @@ -34,9 +35,10 @@ Phase 3 (mincore removal): 56.8 M ops/s Phase 4 (Hot/Cold Box): 57.2 M ops/s (+0.7%) Phase 5 (Mid MT fix): 52.3 M ops/s (-8.6% regression) Phase 6 (Lock-free Mid MT): 42.1 M ops/s (Mid MT: +2.65%) -Phase 7 (Unified front): 80.6 M ops/s (+54.2%!) ⭐ +Phase 7-Step1 (Unified front): 80.6 M ops/s (+54.2%!) ⭐ +Phase 7-Step4 (Dead code): 81.5 M ops/s (+1.1%) ⭐⭐ -Total improvement: +41.9% (56.8M → 80.6M) from Phase 3 +Total improvement: +43.5% (56.8M → 81.5M) from Phase 3 ``` ### Benchmark Results Summary @@ -46,6 +48,7 @@ Total improvement: +41.9% (56.8M → 80.6M) from Phase 3 Phase 7-Step1 (branch hint): 80.6 M ops/s (+54.2%) Phase 7-Step2 (PGO mode): 80.3 M ops/s (-0.37%, noise) Phase 7-Step3 (config box): 80.6 M ops/s (+0.37%, noise) +Phase 7-Step4 (macros): 81.5 M ops/s (+1.1%, dead code elimination!) ``` **bench_mid_mt_gap (1KB-8KB, Mid MT workload, ws=256)**: @@ -136,6 +139,53 @@ bench_random_mixed_hakmem.o: bench_random_mixed.c hakmem.h #define TINY_FRONT_ULTRA_SLIM_ENABLED ultra_slim_mode_enabled() // Runtime check ``` +### Phase 7-Step4 (Macro Replacement) + +**File**: `core/tiny_alloc_fast.inc.h` +**Lines**: 421, 757, 809 (3 hot path checks) + +**Changes**: +Replace runtime checks with config macros for dead code elimination: + +```c +// Line 421: FastCache check +// Before: +if (__builtin_expect(g_fastcache_enable && class_idx <= 3, 1)) { +// After: +if (__builtin_expect(TINY_FRONT_FASTCACHE_ENABLED && class_idx <= 3, 1)) { + +// Line 809: Heap V2 check +// Before: +if (__builtin_expect(tiny_heap_v2_enabled() && front_prune_heapv2_enabled() && class_idx <= 3, 0)) { +// After: +if (__builtin_expect(TINY_FRONT_HEAP_V2_ENABLED && front_prune_heapv2_enabled() && class_idx <= 3, 0)) { + +// Line 757: Ultra SLIM check +// Before: +if (__builtin_expect(ultra_slim_mode_enabled(), 0)) { +// After: +if (__builtin_expect(TINY_FRONT_ULTRA_SLIM_ENABLED, 0)) { +``` + +**Effect**: Dead code elimination in PGO mode +- PGO mode (`-DHAKMEM_TINY_FRONT_PGO=1`): + - `if (0 && ...) { ... }` → entire block removed by compiler + - Smaller code size, better instruction cache locality + - Fewer branches in hot path +- Normal mode (default): + - `if (g_fastcache_enable && ...) { ... }` → runtime check preserved + - Full backward compatibility with ENV variables + +**Performance Impact**: +- Before: 80.6 M ops/s (Phase 7-Step3) +- After: 81.0 / 81.0 / 82.4 M ops/s (3 runs) +- Average: 81.5 M ops/s (+1.1%, +0.9 M ops/s) + +**Dead Code Eliminated**: +1. FastCache path (C0-C3): `fastcache_pop()` call + hit/miss tracking +2. Heap V2 path: `tiny_heap_v2_alloc_by_class()` + metrics +3. Ultra SLIM path: `ultra_slim_alloc_with_refill()` early return + --- ## Next Phase Options (from Task Agent Plan)