Files
hakmem/CURRENT_TASK.md

362 lines
12 KiB
Markdown
Raw Normal View History

# Current Task: Phase 8 Complete - BenchFast Root Cause Fixes
**Date**: 2025-11-30
**Status**: Phase 8 ✅ COMPLETE (Root Cause Fixes)
**Achievement**: BenchFast crash investigation and fixes (TLS→Atomic + Header write)
---
## Phase 8 Complete! ✅
**Result**: BenchFast crash root cause investigation and fixes **COMPLETE**
**Performance**: 16.3M ops/s (normal mode, working)
**Duration**: 1 day (investigation + fixes)
**Completed Steps**:
- ✅ Layer 0: Limited prealloc to actual TLS SLL capacity (50,000 → 128 blocks/class)
- ✅ Layer 1: Removed unnecessary unified_cache_init() call (design misunderstanding)
- ✅ Layer 2: Infrastructure isolation (__libc_calloc for Unified Cache)
- ✅ Layer 3: Box Contract documentation (BenchFast uses TLS SLL, not UC)
- ✅ TLS→Atomic: Fixed cross-thread guard variable (pthread_once bug)
- ✅ Header Write: Direct write to bypass P3 optimization (free routing bug)
**Key Discoveries** (箱理論 Root Cause Analysis):
1. **Design Misunderstanding** (Layer 1): BenchFast uses TLS SLL directly, NOT Unified Cache
- unified_cache_init() created 16KB mmap allocations
- Later freed via BenchFast → header misclassification → CRASH
2. **TLS Scope Bug** (Atomic Fix): `__thread int` doesn't work across threads
- pthread_once() creates new thread with fresh TLS (= 0)
- Guard broken → getenv() allocates via BenchFast → freed by __libc_free() → CRASH
3. **P3 Optimization Bug** (Header Fix): tiny_region_id_write_header() skips writes by default
- BenchFast free routing requires 0xa0-0xa7 magic header
- No header → __libc_free() tries to free HAKMEM pointer → CRASH
**箱理論 Validation**:
```
Single Responsibility: ✅ Guard protects entire process (not per-thread)
Clear Contract: ✅ BenchFast always writes headers (explicit)
Observable: ✅ Atomic variable visible across all threads
Composable: ✅ Works with pthread_once() and any threading model
```
---
## Commits
### Phase 8 Root Cause Fix
**Commit**: `191e65983`
**Date**: 2025-11-30
**Files**: 3 files, 36 insertions(+), 13 deletions(-)
**Changes**:
1. `bench_fast_box.c` (Layer 0 + Layer 1):
- Removed unified_cache_init() call (design misunderstanding)
- Limited prealloc to 128 blocks/class (actual TLS SLL capacity)
- Added root cause comments explaining why unified_cache_init() was wrong
2. `bench_fast_box.h` (Layer 3):
- Added Box Contract documentation (BenchFast uses TLS SLL, NOT UC)
- Documented scope separation (workload vs infrastructure allocations)
- Added contract violation example (Phase 8 bug explanation)
3. `tiny_unified_cache.c` (Layer 2):
- Changed calloc() → __libc_calloc() (infrastructure isolation)
- Changed free() → __libc_free() (symmetric cleanup)
- Added defensive fix comments explaining infrastructure bypass
### Phase 8-TLS-Fix
**Commit**: `da8f4d2c8`
**Date**: 2025-11-30
**Files**: 3 files, 21 insertions(+), 11 deletions(-)
**Changes**:
1. `bench_fast_box.c` (TLS→Atomic):
- Changed `__thread int bench_fast_init_in_progress``atomic_int g_bench_fast_init_in_progress`
- Added atomic_load() for reads, atomic_store() for writes
- Added root cause comments (pthread_once creates fresh TLS)
2. `bench_fast_box.h` (TLS→Atomic):
- Updated extern declaration to match atomic_int
- Added Phase 8-TLS-Fix comment explaining cross-thread safety
3. `bench_fast_box.c` (Header Write):
- Replaced `tiny_region_id_write_header()` → direct write `*(uint8_t*)base = 0xa0 | class_idx`
- Added Phase 8-P3-Fix comment explaining P3 optimization bypass
- Contract: BenchFast always writes headers (required for free routing)
4. `hak_wrappers.inc.h` (Atomic):
- Updated bench_fast_init_in_progress check to use atomic_load()
- Added Phase 8-TLS-Fix comment for cross-thread safety
---
## Performance Journey
### Phase-by-Phase Progress
```
Phase 3 (mincore removal): 56.8 M ops/s
Phase 4 (Hot/Cold Box): 57.2 M ops/s (+0.7%)
Phase 5 (Mid MT fix): 52.3 M ops/s (-8.6% regression)
Phase 6 (Lock-free Mid MT): 42.1 M ops/s (Mid MT: +2.65%)
Phase 7-Step1 (Unified front): 80.6 M ops/s (+54.2%!) ⭐
Phase 7-Step4 (Dead code): 81.5 M ops/s (+1.1%) ⭐⭐
Phase 8 (Normal mode): 16.3 M ops/s (working, different workload)
Total improvement: +43.5% (56.8M → 81.5M) from Phase 3
```
**Note**: Phase 8 used different benchmark (10M iterations, ws=8192) vs Phase 7 (ws=256).
Normal mode performance: 16.3M ops/s (working, no crash).
---
## Technical Details
### Layer 0: Prealloc Capacity Fix
**File**: `core/box/bench_fast_box.c`
**Lines**: 131-148
**Root Cause**:
- Old code preallocated 50,000 blocks/class
- TLS SLL actual capacity: 128 blocks (adaptive sizing limit)
- Lost blocks (beyond 128) caused heap corruption
**Fix**:
```c
// Before:
const uint32_t PREALLOC_COUNT = 50000; // Too large!
// After:
const uint32_t ACTUAL_TLS_SLL_CAPACITY = 128; // Observed actual capacity
for (int cls = 2; cls <= 7; cls++) {
uint32_t capacity = ACTUAL_TLS_SLL_CAPACITY;
for (int i = 0; i < (int)capacity; i++) {
// preallocate...
}
}
```
### Layer 1: Design Misunderstanding Fix
**File**: `core/box/bench_fast_box.c`
**Lines**: 123-128 (REMOVED)
**Root Cause**:
- BenchFast uses TLS SLL directly (g_tls_sll[])
- Unified Cache is NOT used by BenchFast
- unified_cache_init() created 16KB allocations (infrastructure)
- Later freed by BenchFast → header misclassification → CRASH
**Fix**:
```c
// REMOVED:
// unified_cache_init(); // WRONG! BenchFast uses TLS SLL, not Unified Cache
// Added comment:
// Phase 8 Root Cause Fix: REMOVED unified_cache_init() call
// Reason: BenchFast uses TLS SLL directly, NOT Unified Cache
```
### Layer 2: Infrastructure Isolation
**File**: `core/front/tiny_unified_cache.c`
**Lines**: 61-71 (init), 103-109 (shutdown)
**Strategy**: Dual-Path Separation
- **Workload allocations** (measured): HAKMEM paths (TLS SLL, Unified Cache)
- **Infrastructure allocations** (unmeasured): __libc_calloc/__libc_free
**Fix**:
```c
// Before:
g_unified_cache[cls].slots = (void**)calloc(cap, sizeof(void*));
// After:
extern void* __libc_calloc(size_t, size_t);
g_unified_cache[cls].slots = (void**)__libc_calloc(cap, sizeof(void*));
```
### Layer 3: Box Contract Documentation
**File**: `core/box/bench_fast_box.h`
**Lines**: 13-51
**Added Documentation**:
- BenchFast uses TLS SLL, NOT Unified Cache
- Scope separation (workload vs infrastructure)
- Preconditions and guarantees
- Contract violation example (Phase 8 bug)
### TLS→Atomic Fix
**File**: `core/box/bench_fast_box.c`
**Lines**: 22-27 (declaration), 37, 124, 215 (usage)
**Root Cause**:
```
pthread_once() → creates new thread
New thread has fresh TLS (bench_fast_init_in_progress = 0)
Guard broken → getenv() allocates → freed by __libc_free() → CRASH
```
**Fix**:
```c
// Before (TLS - broken):
__thread int bench_fast_init_in_progress = 0;
if (__builtin_expect(bench_fast_init_in_progress, 0)) { ... }
// After (Atomic - fixed):
atomic_int g_bench_fast_init_in_progress = 0;
if (__builtin_expect(atomic_load(&g_bench_fast_init_in_progress), 0)) { ... }
```
**箱理論 Validation**:
- **Responsibility**: Guard must protect entire process (not per-thread)
- **Contract**: "No BenchFast allocations during init" (all threads)
- **Observable**: Atomic variable visible across all threads
- **Composable**: Works with pthread_once() threading model
### Header Write Fix
**File**: `core/box/bench_fast_box.c`
**Lines**: 70-80
**Root Cause**:
- P3 optimization: tiny_region_id_write_header() skips header writes by default
- BenchFast free routing checks header magic (0xa0-0xa7)
- No header → free() misroutes to __libc_free() → CRASH
**Fix**:
```c
// Before (broken - calls function that skips write):
tiny_region_id_write_header(base, class_idx);
return (void*)((char*)base + 1);
// After (fixed - direct write):
*(uint8_t*)base = (uint8_t)(0xa0 | (class_idx & 0x0f)); // Direct write
return (void*)((char*)base + 1);
```
**Contract**: BenchFast always writes headers (required for free routing)
---
## Next Phase Options
Phase 4-Step1: Add PGO workflow automation (+6.25% performance) Implemented automated Profile-Guided Optimization workflow using Box pattern: Performance Improvement: - Baseline: 57.0 M ops/s - PGO-optimized: 60.6 M ops/s - Gain: +6.25% (within expected +5-10% range) Implementation: 1. scripts/box/pgo_tiny_profile_config.sh - 5 representative workloads 2. scripts/box/pgo_tiny_profile_box.sh - Automated profile collection 3. Makefile PGO targets: - pgo-tiny-profile: Build instrumented binaries - pgo-tiny-collect: Collect .gcda profile data - pgo-tiny-build: Build optimized binaries - pgo-tiny-full: Complete workflow (profile → collect → build → test) 4. Makefile help target: Added PGO instructions for discoverability Design: - Box化: Single responsibility, clear contracts - Deterministic: Fixed seeds (42) for reproducibility - Safe: Validation, error detection, timeout protection (30s/workload) - Observable: Progress reporting, .gcda verification (33 files generated) Workload Coverage: - Random mixed: 3 working set sizes (128/256/512 slots) - Tiny hot: 2 size classes (16B/64B) - Total: 5 workloads covering hot/cold paths Documentation: - PHASE4_STEP1_COMPLETE.md - Completion report - CURRENT_TASK.md - Phase 4 roadmap (Step 1 complete ✓) - docs/design/PHASE4_TINY_FRONT_BOX_DESIGN.md - Complete Phase 4 design Next: Phase 4-Step2 (Hot/Cold Path Box, target +10-15%) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-29 11:28:38 +09:00
### Option A: Continue Phase 7 (Steps 5-7) 📦
**Goal**: Remove remaining legacy layers (complete dead code elimination)
**Expected**: Additional +3-5% via further code cleanup
**Duration**: 1-2 days
**Risk**: Low (infrastructure already in place)
Phase 4-Step1: Add PGO workflow automation (+6.25% performance) Implemented automated Profile-Guided Optimization workflow using Box pattern: Performance Improvement: - Baseline: 57.0 M ops/s - PGO-optimized: 60.6 M ops/s - Gain: +6.25% (within expected +5-10% range) Implementation: 1. scripts/box/pgo_tiny_profile_config.sh - 5 representative workloads 2. scripts/box/pgo_tiny_profile_box.sh - Automated profile collection 3. Makefile PGO targets: - pgo-tiny-profile: Build instrumented binaries - pgo-tiny-collect: Collect .gcda profile data - pgo-tiny-build: Build optimized binaries - pgo-tiny-full: Complete workflow (profile → collect → build → test) 4. Makefile help target: Added PGO instructions for discoverability Design: - Box化: Single responsibility, clear contracts - Deterministic: Fixed seeds (42) for reproducibility - Safe: Validation, error detection, timeout protection (30s/workload) - Observable: Progress reporting, .gcda verification (33 files generated) Workload Coverage: - Random mixed: 3 working set sizes (128/256/512 slots) - Tiny hot: 2 size classes (16B/64B) - Total: 5 workloads covering hot/cold paths Documentation: - PHASE4_STEP1_COMPLETE.md - Completion report - CURRENT_TASK.md - Phase 4 roadmap (Step 1 complete ✓) - docs/design/PHASE4_TINY_FRONT_BOX_DESIGN.md - Complete Phase 4 design Next: Phase 4-Step2 (Hot/Cold Path Box, target +10-15%) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-29 11:28:38 +09:00
**Remaining Steps**:
- Step 5: Compile library with PGO flag (Makefile change)
- Step 6: Verify dead code elimination in assembly
- Step 7: Measure performance improvement
Phase 4-Step1: Add PGO workflow automation (+6.25% performance) Implemented automated Profile-Guided Optimization workflow using Box pattern: Performance Improvement: - Baseline: 57.0 M ops/s - PGO-optimized: 60.6 M ops/s - Gain: +6.25% (within expected +5-10% range) Implementation: 1. scripts/box/pgo_tiny_profile_config.sh - 5 representative workloads 2. scripts/box/pgo_tiny_profile_box.sh - Automated profile collection 3. Makefile PGO targets: - pgo-tiny-profile: Build instrumented binaries - pgo-tiny-collect: Collect .gcda profile data - pgo-tiny-build: Build optimized binaries - pgo-tiny-full: Complete workflow (profile → collect → build → test) 4. Makefile help target: Added PGO instructions for discoverability Design: - Box化: Single responsibility, clear contracts - Deterministic: Fixed seeds (42) for reproducibility - Safe: Validation, error detection, timeout protection (30s/workload) - Observable: Progress reporting, .gcda verification (33 files generated) Workload Coverage: - Random mixed: 3 working set sizes (128/256/512 slots) - Tiny hot: 2 size classes (16B/64B) - Total: 5 workloads covering hot/cold paths Documentation: - PHASE4_STEP1_COMPLETE.md - Completion report - CURRENT_TASK.md - Phase 4 roadmap (Step 1 complete ✓) - docs/design/PHASE4_TINY_FRONT_BOX_DESIGN.md - Complete Phase 4 design Next: Phase 4-Step2 (Hot/Cold Path Box, target +10-15%) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-29 11:28:38 +09:00
### Option B: PGO Re-enablement 🚀
**Goal**: Re-enable PGO workflow from Phase 4-Step1
**Expected**: +6-13% cumulative (on top of 81.5M)
**Duration**: 2-3 days
**Risk**: Low (proven pattern)
Phase 4-Step1: Add PGO workflow automation (+6.25% performance) Implemented automated Profile-Guided Optimization workflow using Box pattern: Performance Improvement: - Baseline: 57.0 M ops/s - PGO-optimized: 60.6 M ops/s - Gain: +6.25% (within expected +5-10% range) Implementation: 1. scripts/box/pgo_tiny_profile_config.sh - 5 representative workloads 2. scripts/box/pgo_tiny_profile_box.sh - Automated profile collection 3. Makefile PGO targets: - pgo-tiny-profile: Build instrumented binaries - pgo-tiny-collect: Collect .gcda profile data - pgo-tiny-build: Build optimized binaries - pgo-tiny-full: Complete workflow (profile → collect → build → test) 4. Makefile help target: Added PGO instructions for discoverability Design: - Box化: Single responsibility, clear contracts - Deterministic: Fixed seeds (42) for reproducibility - Safe: Validation, error detection, timeout protection (30s/workload) - Observable: Progress reporting, .gcda verification (33 files generated) Workload Coverage: - Random mixed: 3 working set sizes (128/256/512 slots) - Tiny hot: 2 size classes (16B/64B) - Total: 5 workloads covering hot/cold paths Documentation: - PHASE4_STEP1_COMPLETE.md - Completion report - CURRENT_TASK.md - Phase 4 roadmap (Step 1 complete ✓) - docs/design/PHASE4_TINY_FRONT_BOX_DESIGN.md - Complete Phase 4 design Next: Phase 4-Step2 (Hot/Cold Path Box, target +10-15%) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-29 11:28:38 +09:00
**Current projection**:
- Phase 7 baseline: 81.5 M ops/s
- With PGO: ~86-93 M ops/s (+6-13%)
### Option C: BenchFast Pool Expansion 🏎️
**Goal**: Increase BenchFast pool size for full 10M iteration support
**Expected**: Structural ceiling measurement (30-40M ops/s target)
**Duration**: 1 day
**Risk**: Low (just increase prealloc count)
**Current status**:
- Pool: 128 blocks/class (768 total)
- Exhaustion: C6/C7 exhaust after ~200 iterations
- Need: ~10,000 blocks/class for 10M iterations (60,000 total)
Phase 4-Step1: Add PGO workflow automation (+6.25% performance) Implemented automated Profile-Guided Optimization workflow using Box pattern: Performance Improvement: - Baseline: 57.0 M ops/s - PGO-optimized: 60.6 M ops/s - Gain: +6.25% (within expected +5-10% range) Implementation: 1. scripts/box/pgo_tiny_profile_config.sh - 5 representative workloads 2. scripts/box/pgo_tiny_profile_box.sh - Automated profile collection 3. Makefile PGO targets: - pgo-tiny-profile: Build instrumented binaries - pgo-tiny-collect: Collect .gcda profile data - pgo-tiny-build: Build optimized binaries - pgo-tiny-full: Complete workflow (profile → collect → build → test) 4. Makefile help target: Added PGO instructions for discoverability Design: - Box化: Single responsibility, clear contracts - Deterministic: Fixed seeds (42) for reproducibility - Safe: Validation, error detection, timeout protection (30s/workload) - Observable: Progress reporting, .gcda verification (33 files generated) Workload Coverage: - Random mixed: 3 working set sizes (128/256/512 slots) - Tiny hot: 2 size classes (16B/64B) - Total: 5 workloads covering hot/cold paths Documentation: - PHASE4_STEP1_COMPLETE.md - Completion report - CURRENT_TASK.md - Phase 4 roadmap (Step 1 complete ✓) - docs/design/PHASE4_TINY_FRONT_BOX_DESIGN.md - Complete Phase 4 design Next: Phase 4-Step2 (Hot/Cold Path Box, target +10-15%) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-29 11:28:38 +09:00
### Option D: Production Readiness 📊
**Goal**: Comprehensive benchmark suite, deployment guide
**Expected**: Full performance comparison, stability testing
**Duration**: 3-5 days
**Risk**: Low (documentation + testing)
Phase 4-Step1: Add PGO workflow automation (+6.25% performance) Implemented automated Profile-Guided Optimization workflow using Box pattern: Performance Improvement: - Baseline: 57.0 M ops/s - PGO-optimized: 60.6 M ops/s - Gain: +6.25% (within expected +5-10% range) Implementation: 1. scripts/box/pgo_tiny_profile_config.sh - 5 representative workloads 2. scripts/box/pgo_tiny_profile_box.sh - Automated profile collection 3. Makefile PGO targets: - pgo-tiny-profile: Build instrumented binaries - pgo-tiny-collect: Collect .gcda profile data - pgo-tiny-build: Build optimized binaries - pgo-tiny-full: Complete workflow (profile → collect → build → test) 4. Makefile help target: Added PGO instructions for discoverability Design: - Box化: Single responsibility, clear contracts - Deterministic: Fixed seeds (42) for reproducibility - Safe: Validation, error detection, timeout protection (30s/workload) - Observable: Progress reporting, .gcda verification (33 files generated) Workload Coverage: - Random mixed: 3 working set sizes (128/256/512 slots) - Tiny hot: 2 size classes (16B/64B) - Total: 5 workloads covering hot/cold paths Documentation: - PHASE4_STEP1_COMPLETE.md - Completion report - CURRENT_TASK.md - Phase 4 roadmap (Step 1 complete ✓) - docs/design/PHASE4_TINY_FRONT_BOX_DESIGN.md - Complete Phase 4 design Next: Phase 4-Step2 (Hot/Cold Path Box, target +10-15%) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-29 11:28:38 +09:00
---
## Recommendation
### Top Pick: **Option C (BenchFast Pool Expansion)** 🏎️
Phase 4-Step1: Add PGO workflow automation (+6.25% performance) Implemented automated Profile-Guided Optimization workflow using Box pattern: Performance Improvement: - Baseline: 57.0 M ops/s - PGO-optimized: 60.6 M ops/s - Gain: +6.25% (within expected +5-10% range) Implementation: 1. scripts/box/pgo_tiny_profile_config.sh - 5 representative workloads 2. scripts/box/pgo_tiny_profile_box.sh - Automated profile collection 3. Makefile PGO targets: - pgo-tiny-profile: Build instrumented binaries - pgo-tiny-collect: Collect .gcda profile data - pgo-tiny-build: Build optimized binaries - pgo-tiny-full: Complete workflow (profile → collect → build → test) 4. Makefile help target: Added PGO instructions for discoverability Design: - Box化: Single responsibility, clear contracts - Deterministic: Fixed seeds (42) for reproducibility - Safe: Validation, error detection, timeout protection (30s/workload) - Observable: Progress reporting, .gcda verification (33 files generated) Workload Coverage: - Random mixed: 3 working set sizes (128/256/512 slots) - Tiny hot: 2 size classes (16B/64B) - Total: 5 workloads covering hot/cold paths Documentation: - PHASE4_STEP1_COMPLETE.md - Completion report - CURRENT_TASK.md - Phase 4 roadmap (Step 1 complete ✓) - docs/design/PHASE4_TINY_FRONT_BOX_DESIGN.md - Complete Phase 4 design Next: Phase 4-Step2 (Hot/Cold Path Box, target +10-15%) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-29 11:28:38 +09:00
**Reasoning**:
1. **Phase 8 fixes working**: TLS→Atomic + Header write proven
2. **Quick win**: Just increase ACTUAL_TLS_SLL_CAPACITY to 10,000
3. **Scientific value**: Measure true structural ceiling (no safety costs)
4. **Low risk**: 1-day task, no code changes (just capacity tuning)
5. **Data-driven**: Enables comparison vs normal mode (16.3M vs 30-40M expected)
**Expected Result**:
```
Normal mode: 16.3 M ops/s (current)
BenchFast mode: 30-40 M ops/s (target, 2-2.5x faster)
```
Phase 4-Step1: Add PGO workflow automation (+6.25% performance) Implemented automated Profile-Guided Optimization workflow using Box pattern: Performance Improvement: - Baseline: 57.0 M ops/s - PGO-optimized: 60.6 M ops/s - Gain: +6.25% (within expected +5-10% range) Implementation: 1. scripts/box/pgo_tiny_profile_config.sh - 5 representative workloads 2. scripts/box/pgo_tiny_profile_box.sh - Automated profile collection 3. Makefile PGO targets: - pgo-tiny-profile: Build instrumented binaries - pgo-tiny-collect: Collect .gcda profile data - pgo-tiny-build: Build optimized binaries - pgo-tiny-full: Complete workflow (profile → collect → build → test) 4. Makefile help target: Added PGO instructions for discoverability Design: - Box化: Single responsibility, clear contracts - Deterministic: Fixed seeds (42) for reproducibility - Safe: Validation, error detection, timeout protection (30s/workload) - Observable: Progress reporting, .gcda verification (33 files generated) Workload Coverage: - Random mixed: 3 working set sizes (128/256/512 slots) - Tiny hot: 2 size classes (16B/64B) - Total: 5 workloads covering hot/cold paths Documentation: - PHASE4_STEP1_COMPLETE.md - Completion report - CURRENT_TASK.md - Phase 4 roadmap (Step 1 complete ✓) - docs/design/PHASE4_TINY_FRONT_BOX_DESIGN.md - Complete Phase 4 design Next: Phase 4-Step2 (Hot/Cold Path Box, target +10-15%) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-29 11:28:38 +09:00
**Implementation**:
```c
// core/box/bench_fast_box.c:140
const uint32_t ACTUAL_TLS_SLL_CAPACITY = 10000; // Was 128
```
Phase 4-Step1: Add PGO workflow automation (+6.25% performance) Implemented automated Profile-Guided Optimization workflow using Box pattern: Performance Improvement: - Baseline: 57.0 M ops/s - PGO-optimized: 60.6 M ops/s - Gain: +6.25% (within expected +5-10% range) Implementation: 1. scripts/box/pgo_tiny_profile_config.sh - 5 representative workloads 2. scripts/box/pgo_tiny_profile_box.sh - Automated profile collection 3. Makefile PGO targets: - pgo-tiny-profile: Build instrumented binaries - pgo-tiny-collect: Collect .gcda profile data - pgo-tiny-build: Build optimized binaries - pgo-tiny-full: Complete workflow (profile → collect → build → test) 4. Makefile help target: Added PGO instructions for discoverability Design: - Box化: Single responsibility, clear contracts - Deterministic: Fixed seeds (42) for reproducibility - Safe: Validation, error detection, timeout protection (30s/workload) - Observable: Progress reporting, .gcda verification (33 files generated) Workload Coverage: - Random mixed: 3 working set sizes (128/256/512 slots) - Tiny hot: 2 size classes (16B/64B) - Total: 5 workloads covering hot/cold paths Documentation: - PHASE4_STEP1_COMPLETE.md - Completion report - CURRENT_TASK.md - Phase 4 roadmap (Step 1 complete ✓) - docs/design/PHASE4_TINY_FRONT_BOX_DESIGN.md - Complete Phase 4 design Next: Phase 4-Step2 (Hot/Cold Path Box, target +10-15%) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-29 11:28:38 +09:00
---
### Second Choice: **Option B (PGO Re-enablement)** 🚀
Phase 4-Step1: Add PGO workflow automation (+6.25% performance) Implemented automated Profile-Guided Optimization workflow using Box pattern: Performance Improvement: - Baseline: 57.0 M ops/s - PGO-optimized: 60.6 M ops/s - Gain: +6.25% (within expected +5-10% range) Implementation: 1. scripts/box/pgo_tiny_profile_config.sh - 5 representative workloads 2. scripts/box/pgo_tiny_profile_box.sh - Automated profile collection 3. Makefile PGO targets: - pgo-tiny-profile: Build instrumented binaries - pgo-tiny-collect: Collect .gcda profile data - pgo-tiny-build: Build optimized binaries - pgo-tiny-full: Complete workflow (profile → collect → build → test) 4. Makefile help target: Added PGO instructions for discoverability Design: - Box化: Single responsibility, clear contracts - Deterministic: Fixed seeds (42) for reproducibility - Safe: Validation, error detection, timeout protection (30s/workload) - Observable: Progress reporting, .gcda verification (33 files generated) Workload Coverage: - Random mixed: 3 working set sizes (128/256/512 slots) - Tiny hot: 2 size classes (16B/64B) - Total: 5 workloads covering hot/cold paths Documentation: - PHASE4_STEP1_COMPLETE.md - Completion report - CURRENT_TASK.md - Phase 4 roadmap (Step 1 complete ✓) - docs/design/PHASE4_TINY_FRONT_BOX_DESIGN.md - Complete Phase 4 design Next: Phase 4-Step2 (Hot/Cold Path Box, target +10-15%) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-29 11:28:38 +09:00
**Reasoning**:
1. **Proven benefit**: +6.25% in Phase 4-Step1
2. **Cumulative**: Would stack with Phase 7 (81.5M baseline)
3. **Low risk**: Just fix build issue
4. **High impact**: ~86-93 M ops/s projected
Phase 4-Step1: Add PGO workflow automation (+6.25% performance) Implemented automated Profile-Guided Optimization workflow using Box pattern: Performance Improvement: - Baseline: 57.0 M ops/s - PGO-optimized: 60.6 M ops/s - Gain: +6.25% (within expected +5-10% range) Implementation: 1. scripts/box/pgo_tiny_profile_config.sh - 5 representative workloads 2. scripts/box/pgo_tiny_profile_box.sh - Automated profile collection 3. Makefile PGO targets: - pgo-tiny-profile: Build instrumented binaries - pgo-tiny-collect: Collect .gcda profile data - pgo-tiny-build: Build optimized binaries - pgo-tiny-full: Complete workflow (profile → collect → build → test) 4. Makefile help target: Added PGO instructions for discoverability Design: - Box化: Single responsibility, clear contracts - Deterministic: Fixed seeds (42) for reproducibility - Safe: Validation, error detection, timeout protection (30s/workload) - Observable: Progress reporting, .gcda verification (33 files generated) Workload Coverage: - Random mixed: 3 working set sizes (128/256/512 slots) - Tiny hot: 2 size classes (16B/64B) - Total: 5 workloads covering hot/cold paths Documentation: - PHASE4_STEP1_COMPLETE.md - Completion report - CURRENT_TASK.md - Phase 4 roadmap (Step 1 complete ✓) - docs/design/PHASE4_TINY_FRONT_BOX_DESIGN.md - Complete Phase 4 design Next: Phase 4-Step2 (Hot/Cold Path Box, target +10-15%) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-29 11:28:38 +09:00
---
## Current Performance Summary
### bench_random_mixed (16B-1KB, Tiny workload)
```
Phase 7-Step4 (ws=256): 81.5 M ops/s (+55.5% total)
Phase 8 (ws=8192): 16.3 M ops/s (normal mode, working)
```
### bench_mid_mt_gap (1KB-8KB, Mid MT workload, ws=256)
```
After Phase 6-B (lock-free): 42.09 M ops/s (+2.65%)
vs System malloc: 26.8 M ops/s (1.57x faster)
```
### Overall Status
-**Tiny allocations** (16B-1KB): **81.5 M ops/s** (excellent, +55.5%!)
-**Mid MT allocations** (1KB-8KB): 42 M ops/s (excellent, 1.57x vs system)
-**BenchFast mode**: No crash (TLS→Atomic + Header fix working)
- ⏸️ **Large allocations** (32KB-2MB): Not benchmarked yet
- ⏸️ **MT workloads**: No MT benchmarks yet
Phase 4-Step1: Add PGO workflow automation (+6.25% performance) Implemented automated Profile-Guided Optimization workflow using Box pattern: Performance Improvement: - Baseline: 57.0 M ops/s - PGO-optimized: 60.6 M ops/s - Gain: +6.25% (within expected +5-10% range) Implementation: 1. scripts/box/pgo_tiny_profile_config.sh - 5 representative workloads 2. scripts/box/pgo_tiny_profile_box.sh - Automated profile collection 3. Makefile PGO targets: - pgo-tiny-profile: Build instrumented binaries - pgo-tiny-collect: Collect .gcda profile data - pgo-tiny-build: Build optimized binaries - pgo-tiny-full: Complete workflow (profile → collect → build → test) 4. Makefile help target: Added PGO instructions for discoverability Design: - Box化: Single responsibility, clear contracts - Deterministic: Fixed seeds (42) for reproducibility - Safe: Validation, error detection, timeout protection (30s/workload) - Observable: Progress reporting, .gcda verification (33 files generated) Workload Coverage: - Random mixed: 3 working set sizes (128/256/512 slots) - Tiny hot: 2 size classes (16B/64B) - Total: 5 workloads covering hot/cold paths Documentation: - PHASE4_STEP1_COMPLETE.md - Completion report - CURRENT_TASK.md - Phase 4 roadmap (Step 1 complete ✓) - docs/design/PHASE4_TINY_FRONT_BOX_DESIGN.md - Complete Phase 4 design Next: Phase 4-Step2 (Hot/Cold Path Box, target +10-15%) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-29 11:28:38 +09:00
---
## Decision Time
Phase 4-Step1: Add PGO workflow automation (+6.25% performance) Implemented automated Profile-Guided Optimization workflow using Box pattern: Performance Improvement: - Baseline: 57.0 M ops/s - PGO-optimized: 60.6 M ops/s - Gain: +6.25% (within expected +5-10% range) Implementation: 1. scripts/box/pgo_tiny_profile_config.sh - 5 representative workloads 2. scripts/box/pgo_tiny_profile_box.sh - Automated profile collection 3. Makefile PGO targets: - pgo-tiny-profile: Build instrumented binaries - pgo-tiny-collect: Collect .gcda profile data - pgo-tiny-build: Build optimized binaries - pgo-tiny-full: Complete workflow (profile → collect → build → test) 4. Makefile help target: Added PGO instructions for discoverability Design: - Box化: Single responsibility, clear contracts - Deterministic: Fixed seeds (42) for reproducibility - Safe: Validation, error detection, timeout protection (30s/workload) - Observable: Progress reporting, .gcda verification (33 files generated) Workload Coverage: - Random mixed: 3 working set sizes (128/256/512 slots) - Tiny hot: 2 size classes (16B/64B) - Total: 5 workloads covering hot/cold paths Documentation: - PHASE4_STEP1_COMPLETE.md - Completion report - CURRENT_TASK.md - Phase 4 roadmap (Step 1 complete ✓) - docs/design/PHASE4_TINY_FRONT_BOX_DESIGN.md - Complete Phase 4 design Next: Phase 4-Step2 (Hot/Cold Path Box, target +10-15%) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-29 11:28:38 +09:00
**Choose your next phase**:
- **Option A**: Continue Phase 7 (Steps 5-7, final cleanup)
- **Option B**: PGO re-enablement (recommended for normal builds)
- **Option C**: BenchFast pool expansion (recommended for ceiling measurement)
- **Option D**: Production readiness & benchmarking
Phase 4-Step1: Add PGO workflow automation (+6.25% performance) Implemented automated Profile-Guided Optimization workflow using Box pattern: Performance Improvement: - Baseline: 57.0 M ops/s - PGO-optimized: 60.6 M ops/s - Gain: +6.25% (within expected +5-10% range) Implementation: 1. scripts/box/pgo_tiny_profile_config.sh - 5 representative workloads 2. scripts/box/pgo_tiny_profile_box.sh - Automated profile collection 3. Makefile PGO targets: - pgo-tiny-profile: Build instrumented binaries - pgo-tiny-collect: Collect .gcda profile data - pgo-tiny-build: Build optimized binaries - pgo-tiny-full: Complete workflow (profile → collect → build → test) 4. Makefile help target: Added PGO instructions for discoverability Design: - Box化: Single responsibility, clear contracts - Deterministic: Fixed seeds (42) for reproducibility - Safe: Validation, error detection, timeout protection (30s/workload) - Observable: Progress reporting, .gcda verification (33 files generated) Workload Coverage: - Random mixed: 3 working set sizes (128/256/512 slots) - Tiny hot: 2 size classes (16B/64B) - Total: 5 workloads covering hot/cold paths Documentation: - PHASE4_STEP1_COMPLETE.md - Completion report - CURRENT_TASK.md - Phase 4 roadmap (Step 1 complete ✓) - docs/design/PHASE4_TINY_FRONT_BOX_DESIGN.md - Complete Phase 4 design Next: Phase 4-Step2 (Hot/Cold Path Box, target +10-15%) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-29 11:28:38 +09:00
**Or**: Celebrate Phase 8 success! 🎉 (Root cause fixes complete!)
Phase 4-Step1: Add PGO workflow automation (+6.25% performance) Implemented automated Profile-Guided Optimization workflow using Box pattern: Performance Improvement: - Baseline: 57.0 M ops/s - PGO-optimized: 60.6 M ops/s - Gain: +6.25% (within expected +5-10% range) Implementation: 1. scripts/box/pgo_tiny_profile_config.sh - 5 representative workloads 2. scripts/box/pgo_tiny_profile_box.sh - Automated profile collection 3. Makefile PGO targets: - pgo-tiny-profile: Build instrumented binaries - pgo-tiny-collect: Collect .gcda profile data - pgo-tiny-build: Build optimized binaries - pgo-tiny-full: Complete workflow (profile → collect → build → test) 4. Makefile help target: Added PGO instructions for discoverability Design: - Box化: Single responsibility, clear contracts - Deterministic: Fixed seeds (42) for reproducibility - Safe: Validation, error detection, timeout protection (30s/workload) - Observable: Progress reporting, .gcda verification (33 files generated) Workload Coverage: - Random mixed: 3 working set sizes (128/256/512 slots) - Tiny hot: 2 size classes (16B/64B) - Total: 5 workloads covering hot/cold paths Documentation: - PHASE4_STEP1_COMPLETE.md - Completion report - CURRENT_TASK.md - Phase 4 roadmap (Step 1 complete ✓) - docs/design/PHASE4_TINY_FRONT_BOX_DESIGN.md - Complete Phase 4 design Next: Phase 4-Step2 (Hot/Cold Path Box, target +10-15%) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-29 11:28:38 +09:00
---
Updated: 2025-11-30
Phase: 8 COMPLETE (Root Cause Fixes) → 9 PENDING
Previous: Phase 7 (Tiny Front Unification, +55.5%)
Achievement: BenchFast crash investigation and fixes (箱理論 root cause analysis!)