Update CURRENT_TASK.md: Phase 7-Step4 complete (+55.5% total improvement!)
**Updated**: - Status: Phase 7 Step 1-3 → Step 1-4 (complete) - Achievement: +54.2% → +55.5% total (+1.1% from Step 4) - Performance: 52.3M → 81.5M ops/s (+29.2M ops/s total) **Phase 7-Step4 Summary**: - Replace 3 runtime checks with config macros in hot path - Dead code elimination in PGO mode (bench builds) - Performance: 80.6M → 81.5M ops/s (+1.1%, +0.9M ops/s) **Macro Replacements**: 1. `g_fastcache_enable` → `TINY_FRONT_FASTCACHE_ENABLED` (line 421) 2. `tiny_heap_v2_enabled()` → `TINY_FRONT_HEAP_V2_ENABLED` (line 809) 3. `ultra_slim_mode_enabled()` → `TINY_FRONT_ULTRA_SLIM_ENABLED` (line 757) **Dead Code Eliminated** (PGO mode): - FastCache path: fastcache_pop() + hit/miss tracking - Heap V2 path: tiny_heap_v2_alloc_by_class() + metrics - Ultra SLIM path: ultra_slim_alloc_with_refill() early return **Cumulative Phase 7 Results**: - Step 1: Branch hint reversal (+54.2%) - Step 2: PGO mode infrastructure (neutral) - Step 3: Config box integration (neutral) - Step 4: Macro replacement (+1.1%) - **Total: +55.5% improvement (52.3M → 81.5M ops/s)** 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
@ -1,21 +1,22 @@
|
||||
# Current Task: Phase 7 Complete - Next Steps
|
||||
|
||||
**Date**: 2025-11-29
|
||||
**Status**: Phase 7 ✅ COMPLETE (Step 1-3)
|
||||
**Achievement**: Tiny Front Hot Path Unification (+54.2% improvement!)
|
||||
**Status**: Phase 7 ✅ COMPLETE (Step 1-4)
|
||||
**Achievement**: Tiny Front Hot Path Unification + Dead Code Elimination (+55.5% total!)
|
||||
|
||||
---
|
||||
|
||||
## Phase 7 Complete! ✅
|
||||
|
||||
**Result**: Tiny Front Hot Path Unification **COMPLETE** (Step 1-3)
|
||||
**Performance**: 52.3M → 80.6M ops/s (+54.2% improvement, +28.3M ops/s)
|
||||
**Result**: Tiny Front Hot Path Unification **COMPLETE** (Step 1-4)
|
||||
**Performance**: 52.3M → 81.5M ops/s (+55.5% improvement, +29.2M ops/s)
|
||||
**Duration**: <1 day (extremely quick win!)
|
||||
|
||||
**Completed Steps**:
|
||||
- ✅ Step 1: Branch hint reversal (0→1) - **+54.2% improvement**
|
||||
- ✅ Step 2: Compile-time unified gate (PGO mode) - Code quality improvement
|
||||
- ✅ Step 3: Config box integration - Dead code elimination infrastructure
|
||||
- ✅ Step 4: Macro replacement in hot path - **+1.1% additional improvement**
|
||||
|
||||
**Key Discovery** (from ChatGPT + Task agent analysis):
|
||||
- Unified fast path existed but was marked UNLIKELY (`__builtin_expect(..., 0)`)
|
||||
@ -34,9 +35,10 @@ Phase 3 (mincore removal): 56.8 M ops/s
|
||||
Phase 4 (Hot/Cold Box): 57.2 M ops/s (+0.7%)
|
||||
Phase 5 (Mid MT fix): 52.3 M ops/s (-8.6% regression)
|
||||
Phase 6 (Lock-free Mid MT): 42.1 M ops/s (Mid MT: +2.65%)
|
||||
Phase 7 (Unified front): 80.6 M ops/s (+54.2%!) ⭐
|
||||
Phase 7-Step1 (Unified front): 80.6 M ops/s (+54.2%!) ⭐
|
||||
Phase 7-Step4 (Dead code): 81.5 M ops/s (+1.1%) ⭐⭐
|
||||
|
||||
Total improvement: +41.9% (56.8M → 80.6M) from Phase 3
|
||||
Total improvement: +43.5% (56.8M → 81.5M) from Phase 3
|
||||
```
|
||||
|
||||
### Benchmark Results Summary
|
||||
@ -46,6 +48,7 @@ Total improvement: +41.9% (56.8M → 80.6M) from Phase 3
|
||||
Phase 7-Step1 (branch hint): 80.6 M ops/s (+54.2%)
|
||||
Phase 7-Step2 (PGO mode): 80.3 M ops/s (-0.37%, noise)
|
||||
Phase 7-Step3 (config box): 80.6 M ops/s (+0.37%, noise)
|
||||
Phase 7-Step4 (macros): 81.5 M ops/s (+1.1%, dead code elimination!)
|
||||
```
|
||||
|
||||
**bench_mid_mt_gap (1KB-8KB, Mid MT workload, ws=256)**:
|
||||
@ -136,6 +139,53 @@ bench_random_mixed_hakmem.o: bench_random_mixed.c hakmem.h
|
||||
#define TINY_FRONT_ULTRA_SLIM_ENABLED ultra_slim_mode_enabled() // Runtime check
|
||||
```
|
||||
|
||||
### Phase 7-Step4 (Macro Replacement)
|
||||
|
||||
**File**: `core/tiny_alloc_fast.inc.h`
|
||||
**Lines**: 421, 757, 809 (3 hot path checks)
|
||||
|
||||
**Changes**:
|
||||
Replace runtime checks with config macros for dead code elimination:
|
||||
|
||||
```c
|
||||
// Line 421: FastCache check
|
||||
// Before:
|
||||
if (__builtin_expect(g_fastcache_enable && class_idx <= 3, 1)) {
|
||||
// After:
|
||||
if (__builtin_expect(TINY_FRONT_FASTCACHE_ENABLED && class_idx <= 3, 1)) {
|
||||
|
||||
// Line 809: Heap V2 check
|
||||
// Before:
|
||||
if (__builtin_expect(tiny_heap_v2_enabled() && front_prune_heapv2_enabled() && class_idx <= 3, 0)) {
|
||||
// After:
|
||||
if (__builtin_expect(TINY_FRONT_HEAP_V2_ENABLED && front_prune_heapv2_enabled() && class_idx <= 3, 0)) {
|
||||
|
||||
// Line 757: Ultra SLIM check
|
||||
// Before:
|
||||
if (__builtin_expect(ultra_slim_mode_enabled(), 0)) {
|
||||
// After:
|
||||
if (__builtin_expect(TINY_FRONT_ULTRA_SLIM_ENABLED, 0)) {
|
||||
```
|
||||
|
||||
**Effect**: Dead code elimination in PGO mode
|
||||
- PGO mode (`-DHAKMEM_TINY_FRONT_PGO=1`):
|
||||
- `if (0 && ...) { ... }` → entire block removed by compiler
|
||||
- Smaller code size, better instruction cache locality
|
||||
- Fewer branches in hot path
|
||||
- Normal mode (default):
|
||||
- `if (g_fastcache_enable && ...) { ... }` → runtime check preserved
|
||||
- Full backward compatibility with ENV variables
|
||||
|
||||
**Performance Impact**:
|
||||
- Before: 80.6 M ops/s (Phase 7-Step3)
|
||||
- After: 81.0 / 81.0 / 82.4 M ops/s (3 runs)
|
||||
- Average: 81.5 M ops/s (+1.1%, +0.9 M ops/s)
|
||||
|
||||
**Dead Code Eliminated**:
|
||||
1. FastCache path (C0-C3): `fastcache_pop()` call + hit/miss tracking
|
||||
2. Heap V2 path: `tiny_heap_v2_alloc_by_class()` + metrics
|
||||
3. Ultra SLIM path: `ultra_slim_alloc_with_refill()` early return
|
||||
|
||||
---
|
||||
|
||||
## Next Phase Options (from Task Agent Plan)
|
||||
|
||||
Reference in New Issue
Block a user