hakmem/CURRENT_TASK.md

# Current Task: Phase 4 - Tiny Front Optimization

**Date**: 2025-11-29
**Goal**: Tiny allocation throughput 2x improvement (56.8M → 110M+ ops/s)
**Strategy**: Box化 + PGO + Hot/Cold separation

---

## Phase 4 Overview: 3-Step Approach

### Step 1: PGO Workflow Box ✅ COMPLETE (+6.25%)
- **Duration**: ~~1-2 days~~ **Completed: 2025-11-29**
- **Risk**: Low
- **Target**: 56.8M → 60-62M ops/s
- **Actual**: **57.0M → 60.6M ops/s (+6.25%)** ✓

**Deliverables**:
1. ✅ `scripts/box/pgo_tiny_profile_box.sh` - Profile collection automation
2. ✅ `scripts/box/pgo_tiny_profile_config.sh` - Workload configuration
3. ✅ Makefile targets: `pgo-tiny-profile`, `pgo-tiny-collect`, `pgo-tiny-build`, `pgo-tiny-full`
4. ✅ Makefile help target updated with PGO instructions
5. ✅ Benchmark comparison (before/after PGO)
6. ✅ Completion report: `PHASE4_STEP1_COMPLETE.md`

---

### Step 2: Hot/Cold Path Box (Expected: +10-15%)
- **Duration**: 3-5 days
- **Risk**: Medium
- **Target**: 60-62M → 68-75M ops/s (cumulative +15-25%)

**Deliverables**:
1. `core/box/tiny_front_hot_box.h` - Ultra-fast path (5-7 branches max)
2. `core/box/tiny_front_cold_box.h` - Slow path (noinline, cold)
3. Refactor `tiny_alloc_fast()` to use Hot/Cold boxes
4. PGO re-optimization with new structure

---

### Step 3: Front Config Box (Expected: +5-8%)
- **Duration**: 2-3 days
- **Risk**: Low
- **Target**: 68-75M → 73-83M ops/s (cumulative +20-33%)

**Deliverables**:
1. `core/box/tiny_front_config_box.h` - Compile-time config management
2. Replace runtime checks with `TINY_FRONT_*_ENABLED` macros
3. Build flag: `HAKMEM_TINY_FRONT_PGO=1`
4. Final PGO optimization + full benchmark suite

---

## Success Criteria

**bench_random_mixed (ws=256)**:
- Phase 3 baseline: 56.8M ops/s
- Phase 4.1 (PGO): 60-62M ops/s
- Phase 4.2 (Hot/Cold): 68-75M ops/s
- Phase 4.3 (Config): **73-83M ops/s** ✓ (vs mimalloc 107M = 68-77%)

**bench_tiny_hot (64B)**:
- Phase 3 baseline: 81.0M ops/s
- Phase 4.3 target: **100-115M ops/s** ✓ (vs system 156M = 64-74%)

---

## Current Status: Step 1 Complete ✅ → Ready for Step 2

**Completed**:
1. ✅ PGO Profile Collection Box implemented (+6.25% improvement)
2. ✅ Makefile workflow automation (`make pgo-tiny-full`)
3. ✅ Help target updated for discoverability
4. ✅ Completion report written

**Next Actions (Step 2)**:
1. Implement Tiny Front Hot Path Box (5-7 branches max)
2. Implement Tiny Front Cold Path Box (noinline, cold)
3. Refactor `tiny_alloc_fast()` to use Hot/Cold separation
4. Re-run PGO optimization with new structure
5. Benchmark: Target 68-75M ops/s (+10-15% over Step 1)

**Design Reference**: `docs/design/PHASE4_TINY_FRONT_BOX_DESIGN.md` (already complete)

---

## Notes from ChatGPT Analysis

**Real bottleneck**:
- NOT front_gate_v2 alone
- BUT `tiny_alloc_fast()` overall complexity (15-20 branches)

**Branch explosion sources**:
1. ultra_slim_mode_enabled() gate
2. hak_tiny_size_to_class range check
3. tiny_sizeclass_hist_hit (profile)
4. HeapV2 enabled/disabled
5. FastCache enabled/disabled
6. SFC enabled/disabled + hit/miss
7. TLS SLL enabled/disabled + per-class branches
8. Multiple env gates in refill path

**Pool/Tiny boundary**: Negligible overhead (0.1-0.2% in bench)

**memset/page fault**: Already optimized (TRUST_MMAP_ZERO=1)

---

Updated: 2025-11-29
Phase: 4 (Tiny Front Optimization)
Previous: Phase 3 (mincore removal, +10.7%)
Phase 4-Step1: Add PGO workflow automation (+6.25% performance) Implemented automated Profile-Guided Optimization workflow using Box pattern: Performance Improvement: - Baseline: 57.0 M ops/s - PGO-optimized: 60.6 M ops/s - Gain: +6.25% (within expected +5-10% range) Implementation: 1. scripts/box/pgo_tiny_profile_config.sh - 5 representative workloads 2. scripts/box/pgo_tiny_profile_box.sh - Automated profile collection 3. Makefile PGO targets: - pgo-tiny-profile: Build instrumented binaries - pgo-tiny-collect: Collect .gcda profile data - pgo-tiny-build: Build optimized binaries - pgo-tiny-full: Complete workflow (profile → collect → build → test) 4. Makefile help target: Added PGO instructions for discoverability Design: - Box化: Single responsibility, clear contracts - Deterministic: Fixed seeds (42) for reproducibility - Safe: Validation, error detection, timeout protection (30s/workload) - Observable: Progress reporting, .gcda verification (33 files generated) Workload Coverage: - Random mixed: 3 working set sizes (128/256/512 slots) - Tiny hot: 2 size classes (16B/64B) - Total: 5 workloads covering hot/cold paths Documentation: - PHASE4_STEP1_COMPLETE.md - Completion report - CURRENT_TASK.md - Phase 4 roadmap (Step 1 complete ✓) - docs/design/PHASE4_TINY_FRONT_BOX_DESIGN.md - Complete Phase 4 design Next: Phase 4-Step2 (Hot/Cold Path Box, target +10-15%) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> 2025-11-29 11:28:38 +09:00			`# Current Task: Phase 4 - Tiny Front Optimization`

			`Date: 2025-11-29`
			`Goal: Tiny allocation throughput 2x improvement (56.8M → 110M+ ops/s)`
			`Strategy: Box化 + PGO + Hot/Cold separation`

			`---`

			`## Phase 4 Overview: 3-Step Approach`

			`### Step 1: PGO Workflow Box ✅ COMPLETE (+6.25%)`
			`- Duration: ~~1-2 days~~ Completed: 2025-11-29`
			`- Risk: Low`
			`- Target: 56.8M → 60-62M ops/s`
			`- Actual: 57.0M → 60.6M ops/s (+6.25%) ✓`

			`Deliverables:`
			1. ✅ `scripts/box/pgo_tiny_profile_box.sh` - Profile collection automation
			2. ✅ `scripts/box/pgo_tiny_profile_config.sh` - Workload configuration
			3. ✅ Makefile targets: `pgo-tiny-profile`, `pgo-tiny-collect`, `pgo-tiny-build`, `pgo-tiny-full`
			`4. ✅ Makefile help target updated with PGO instructions`
			`5. ✅ Benchmark comparison (before/after PGO)`
			6. ✅ Completion report: `PHASE4_STEP1_COMPLETE.md`

			`---`

			`### Step 2: Hot/Cold Path Box (Expected: +10-15%)`
			`- Duration: 3-5 days`
			`- Risk: Medium`
			`- Target: 60-62M → 68-75M ops/s (cumulative +15-25%)`

			`Deliverables:`
			1. `core/box/tiny_front_hot_box.h` - Ultra-fast path (5-7 branches max)
			2. `core/box/tiny_front_cold_box.h` - Slow path (noinline, cold)
			3. Refactor `tiny_alloc_fast()` to use Hot/Cold boxes
			`4. PGO re-optimization with new structure`

			`---`

			`### Step 3: Front Config Box (Expected: +5-8%)`
			`- Duration: 2-3 days`
			`- Risk: Low`
			`- Target: 68-75M → 73-83M ops/s (cumulative +20-33%)`

			`Deliverables:`
			1. `core/box/tiny_front_config_box.h` - Compile-time config management
			2. Replace runtime checks with `TINY_FRONT_*_ENABLED` macros
			3. Build flag: `HAKMEM_TINY_FRONT_PGO=1`
			`4. Final PGO optimization + full benchmark suite`

			`---`

			`## Success Criteria`

			`bench_random_mixed (ws=256):`
			`- Phase 3 baseline: 56.8M ops/s`
			`- Phase 4.1 (PGO): 60-62M ops/s`
			`- Phase 4.2 (Hot/Cold): 68-75M ops/s`
			`- Phase 4.3 (Config): 73-83M ops/s ✓ (vs mimalloc 107M = 68-77%)`

			`bench_tiny_hot (64B):`
			`- Phase 3 baseline: 81.0M ops/s`
			`- Phase 4.3 target: 100-115M ops/s ✓ (vs system 156M = 64-74%)`

			`---`

			`## Current Status: Step 1 Complete ✅ → Ready for Step 2`

			`Completed:`
			`1. ✅ PGO Profile Collection Box implemented (+6.25% improvement)`
			2. ✅ Makefile workflow automation (`make pgo-tiny-full`)
			`3. ✅ Help target updated for discoverability`
			`4. ✅ Completion report written`

			`Next Actions (Step 2):`
			`1. Implement Tiny Front Hot Path Box (5-7 branches max)`
			`2. Implement Tiny Front Cold Path Box (noinline, cold)`
			3. Refactor `tiny_alloc_fast()` to use Hot/Cold separation
			`4. Re-run PGO optimization with new structure`
			`5. Benchmark: Target 68-75M ops/s (+10-15% over Step 1)`

			Design Reference: `docs/design/PHASE4_TINY_FRONT_BOX_DESIGN.md` (already complete)

			`---`

			`## Notes from ChatGPT Analysis`

			`Real bottleneck:`
			`- NOT front_gate_v2 alone`
			- BUT `tiny_alloc_fast()` overall complexity (15-20 branches)

			`Branch explosion sources:`
			`1. ultra_slim_mode_enabled() gate`
			`2. hak_tiny_size_to_class range check`
			`3. tiny_sizeclass_hist_hit (profile)`
			`4. HeapV2 enabled/disabled`
			`5. FastCache enabled/disabled`
			`6. SFC enabled/disabled + hit/miss`
			`7. TLS SLL enabled/disabled + per-class branches`
			`8. Multiple env gates in refill path`

			`Pool/Tiny boundary: Negligible overhead (0.1-0.2% in bench)`

			`memset/page fault: Already optimized (TRUST_MMAP_ZERO=1)`

			`---`

			`Updated: 2025-11-29`
			`Phase: 4 (Tiny Front Optimization)`
			`Previous: Phase 3 (mincore removal, +10.7%)`