2025-11-29 11:28:38 +09:00
|
|
|
# Current Task: Phase 4 - Tiny Front Optimization
|
|
|
|
|
|
|
|
|
|
**Date**: 2025-11-29
|
|
|
|
|
**Goal**: Tiny allocation throughput 2x improvement (56.8M → 110M+ ops/s)
|
|
|
|
|
**Strategy**: Box化 + PGO + Hot/Cold separation
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
## Phase 4 Overview: 3-Step Approach
|
|
|
|
|
|
|
|
|
|
### Step 1: PGO Workflow Box ✅ COMPLETE (+6.25%)
|
|
|
|
|
- **Duration**: ~~1-2 days~~ **Completed: 2025-11-29**
|
|
|
|
|
- **Risk**: Low
|
|
|
|
|
- **Target**: 56.8M → 60-62M ops/s
|
|
|
|
|
- **Actual**: **57.0M → 60.6M ops/s (+6.25%)** ✓
|
|
|
|
|
|
|
|
|
|
**Deliverables**:
|
|
|
|
|
1. ✅ `scripts/box/pgo_tiny_profile_box.sh` - Profile collection automation
|
|
|
|
|
2. ✅ `scripts/box/pgo_tiny_profile_config.sh` - Workload configuration
|
|
|
|
|
3. ✅ Makefile targets: `pgo-tiny-profile`, `pgo-tiny-collect`, `pgo-tiny-build`, `pgo-tiny-full`
|
|
|
|
|
4. ✅ Makefile help target updated with PGO instructions
|
|
|
|
|
5. ✅ Benchmark comparison (before/after PGO)
|
|
|
|
|
6. ✅ Completion report: `PHASE4_STEP1_COMPLETE.md`
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
2025-11-29 12:00:27 +09:00
|
|
|
### Step 2: Hot/Cold Path Box ✅ COMPLETE (+7.3%)
|
|
|
|
|
- **Duration**: ~~3-5 days~~ **Completed: 2025-11-29**
|
2025-11-29 11:28:38 +09:00
|
|
|
- **Risk**: Medium
|
|
|
|
|
- **Target**: 60-62M → 68-75M ops/s (cumulative +15-25%)
|
2025-11-29 12:00:27 +09:00
|
|
|
- **Actual**: **53.3M → 57.2M ops/s (+7.3%, without PGO)** ✓
|
2025-11-29 11:28:38 +09:00
|
|
|
|
|
|
|
|
**Deliverables**:
|
2025-11-29 12:00:27 +09:00
|
|
|
1. ✅ `core/box/tiny_front_hot_box.h` - Ultra-fast path (1 branch, range check removed)
|
|
|
|
|
2. ✅ `core/box/tiny_front_cold_box.h` - Slow path (noinline, cold)
|
|
|
|
|
3. ✅ Refactored `malloc_tiny_fast()` to use Hot/Cold boxes
|
|
|
|
|
4. ⏸️ PGO re-optimization (temporarily disabled due to build issues)
|
|
|
|
|
5. ✅ Completion report: `PHASE4_STEP2_COMPLETE.md`
|
|
|
|
|
|
|
|
|
|
**Note**: PGO temporarily disabled (build issues). Expected +13-15% with PGO re-enabled.
|
2025-11-29 11:28:38 +09:00
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
2025-11-29 12:20:34 +09:00
|
|
|
### Step 3: Front Config Box ✅ COMPLETE (+2.7-4.9%)
|
|
|
|
|
- **Duration**: ~~2-3 days~~ **Completed: 2025-11-29**
|
2025-11-29 11:28:38 +09:00
|
|
|
- **Risk**: Low
|
|
|
|
|
- **Target**: 68-75M → 73-83M ops/s (cumulative +20-33%)
|
2025-11-29 12:20:34 +09:00
|
|
|
- **Actual**: **50.3M → 52.8M ops/s (+2.7-4.9%, limited scope)** ✓
|
2025-11-29 11:28:38 +09:00
|
|
|
|
|
|
|
|
**Deliverables**:
|
2025-11-29 12:20:34 +09:00
|
|
|
1. ✅ `core/box/tiny_front_config_box.h` - Compile-time config management
|
|
|
|
|
2. ✅ Replace runtime checks with `TINY_FRONT_*_ENABLED` macros (2 call sites)
|
|
|
|
|
3. ✅ Build flag: `HAKMEM_TINY_FRONT_PGO=1`
|
|
|
|
|
4. ⏸️ Final PGO optimization (PGO still disabled due to build issues)
|
|
|
|
|
5. ✅ Completion report: `PHASE4_STEP3_COMPLETE.md`
|
|
|
|
|
|
|
|
|
|
**Note**: Achieved +2.7-4.9% (below +5-8% target) due to limited scope (1 function, 2 call sites).
|
|
|
|
|
Full target achievable by expanding to all config functions (6+ remaining).
|
2025-11-29 11:28:38 +09:00
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
## Success Criteria
|
|
|
|
|
|
|
|
|
|
**bench_random_mixed (ws=256)**:
|
|
|
|
|
- Phase 3 baseline: 56.8M ops/s
|
|
|
|
|
- Phase 4.1 (PGO): 60-62M ops/s
|
|
|
|
|
- Phase 4.2 (Hot/Cold): 68-75M ops/s
|
|
|
|
|
- Phase 4.3 (Config): **73-83M ops/s** ✓ (vs mimalloc 107M = 68-77%)
|
|
|
|
|
|
|
|
|
|
**bench_tiny_hot (64B)**:
|
|
|
|
|
- Phase 3 baseline: 81.0M ops/s
|
|
|
|
|
- Phase 4.3 target: **100-115M ops/s** ✓ (vs system 156M = 64-74%)
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
2025-11-29 12:20:34 +09:00
|
|
|
## Current Status: All 3 Steps Complete ✅ → Next: PGO Fix or Expand Config Box
|
2025-11-29 11:28:38 +09:00
|
|
|
|
2025-11-29 12:00:27 +09:00
|
|
|
**Completed (Step 1)**:
|
|
|
|
|
1. ✅ PGO Profile Collection Box implemented (+6.25% improvement with PGO)
|
2025-11-29 11:28:38 +09:00
|
|
|
2. ✅ Makefile workflow automation (`make pgo-tiny-full`)
|
|
|
|
|
3. ✅ Help target updated for discoverability
|
2025-11-29 12:00:27 +09:00
|
|
|
4. ✅ Completion report: `PHASE4_STEP1_COMPLETE.md`
|
|
|
|
|
|
|
|
|
|
**Completed (Step 2)**:
|
|
|
|
|
1. ✅ Tiny Front Hot Path Box (1 branch, range check removed)
|
|
|
|
|
2. ✅ Tiny Front Cold Path Box (noinline, cold attributes)
|
|
|
|
|
3. ✅ Refactored `malloc_tiny_fast()` with Hot/Cold separation
|
|
|
|
|
4. ✅ Benchmark: **+7.3% improvement** (53.3 → 57.2 M ops/s, without PGO)
|
|
|
|
|
5. ✅ Completion report: `PHASE4_STEP2_COMPLETE.md`
|
|
|
|
|
|
2025-11-29 12:20:34 +09:00
|
|
|
**Completed (Step 3)**:
|
|
|
|
|
1. ✅ Front Config Box (compile-time config, dead code elimination)
|
|
|
|
|
2. ✅ Build flag: `HAKMEM_TINY_FRONT_PGO=1`
|
|
|
|
|
3. ✅ Config macros: `TINY_FRONT_*_ENABLED` (2 call sites updated)
|
|
|
|
|
4. ✅ Benchmark: **+2.7-4.9% improvement** (50.3 → 52.8 M ops/s)
|
|
|
|
|
5. ✅ Completion report: `PHASE4_STEP3_COMPLETE.md`
|
|
|
|
|
|
2025-11-29 12:00:27 +09:00
|
|
|
**Next Actions (Choose One)**:
|
2025-11-29 12:20:34 +09:00
|
|
|
- **Option A: Expand Config Box** - Replace 6+ remaining config functions (+2-3% more expected)
|
|
|
|
|
- **Option B: Fix PGO** - Resolve build issues, re-enable PGO workflow (+6% expected from Step 1)
|
|
|
|
|
- **Option C: Mark Phase 4 Complete** - Move to next phase or final optimization
|
2025-11-29 11:28:38 +09:00
|
|
|
|
|
|
|
|
**Design Reference**: `docs/design/PHASE4_TINY_FRONT_BOX_DESIGN.md` (already complete)
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
## Notes from ChatGPT Analysis
|
|
|
|
|
|
|
|
|
|
**Real bottleneck**:
|
|
|
|
|
- NOT front_gate_v2 alone
|
|
|
|
|
- BUT `tiny_alloc_fast()` overall complexity (15-20 branches)
|
|
|
|
|
|
|
|
|
|
**Branch explosion sources**:
|
|
|
|
|
1. ultra_slim_mode_enabled() gate
|
|
|
|
|
2. hak_tiny_size_to_class range check
|
|
|
|
|
3. tiny_sizeclass_hist_hit (profile)
|
|
|
|
|
4. HeapV2 enabled/disabled
|
|
|
|
|
5. FastCache enabled/disabled
|
|
|
|
|
6. SFC enabled/disabled + hit/miss
|
|
|
|
|
7. TLS SLL enabled/disabled + per-class branches
|
|
|
|
|
8. Multiple env gates in refill path
|
|
|
|
|
|
|
|
|
|
**Pool/Tiny boundary**: Negligible overhead (0.1-0.2% in bench)
|
|
|
|
|
|
|
|
|
|
**memset/page fault**: Already optimized (TRUST_MMAP_ZERO=1)
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
Updated: 2025-11-29
|
|
|
|
|
Phase: 4 (Tiny Front Optimization)
|
|
|
|
|
Previous: Phase 3 (mincore removal, +10.7%)
|