|
|
f8e7cf05b4
|
Phase 16 v1 NEUTRAL, Phase 17 Case B confirmed, Phase 18 design added
## Phase 16 v1: Front FastLane Alloc LEGACY Direct — NEUTRAL (+0.62%)
Target: Reduce alloc-side fixed costs by adding LEGACY direct path to
FastLane entry, mirroring Phase 9/10 free-side winning pattern.
Result: +0.62% on Mixed (below +1.0% GO threshold) → NEUTRAL, freeze as
research box (default OFF).
Critical issue: Initial impl crashed (segfault) for C4-C7. Root cause:
unified_cache_refill() incompatibility. Safety fix: Limited to C0-C3
only (matching existing dualhot pattern).
Files:
- core/box/front_fastlane_alloc_legacy_direct_env_box.{h,c} (new)
- core/box/front_fastlane_box.h (LEGACY direct path, lines 93-119)
- core/bench_profile.h (env refresh sync)
- Makefile (new obj)
- docs/analysis/PHASE16_*.md (design/results/instructions)
ENV: HAKMEM_FRONT_FASTLANE_ALLOC_LEGACY_DIRECT=0 (default OFF, opt-in)
Verdict: Research box frozen. Phase 14-16 plateau confirms dispatch/
routing optimization ROI is exhausted post-Phase-6 FastLane collapse.
---
## Phase 17: FORCE_LIBC Gap Validation — Case B Confirmed
Purpose: Validate "system malloc faster" observation using same-binary
A/B testing to isolate allocator logic差 vs binary layout penalty.
Method:
- Same-binary toggle: HAKMEM_FORCE_LIBC_ALLOC=0/1 (bench_random_mixed_hakmem)
- System binary: bench_random_mixed_system (21K separate binary)
- Perf stat: Hardware counter analysis (I-cache, cycles, instructions)
Result: **Case B confirmed** — Allocator差 negligible, layout penalty dominates.
Gap breakdown (Mixed, 20M iters, ws=400):
- hakmem (FORCE_LIBC=0): 48.12M ops/s
- libc (FORCE_LIBC=1, same binary): 48.31M ops/s → +0.39% (noise level)
- system binary (21K): 83.85M ops/s → +73.57% vs libc, +74.26% vs hakmem
Perf stat (200M iters):
- I-cache misses: 153K (hakmem) → 68K (system) = -55% (smoking gun)
- Cycles: 17.9B → 10.2B = -43%
- Instructions: 41.3B → 21.5B = -48%
- Binary size: 653K → 21K (30x difference)
Root cause: Binary size (30x) causes I-cache thrashing. Code bloat >>
algorithmic efficiency.
Conclusion: Phase 12's "system malloc 1.6x faster" was real, but
misattributed. Gap is layout/I-cache, NOT allocator algorithm.
Files:
- docs/analysis/PHASE17_*.md (results/instructions)
- scripts/run_mixed_10_cleanenv.sh (Phase 9/10 defaults aligned)
Next: Phase 18 Hot Text Isolation (layout optimization, not algorithm opt)
---
## Phase 18: Hot Text Isolation — Design Added
Purpose: Reduce I-cache misses + instruction footprint via layout control
(binary optimization, not allocator algorithm changes).
Strategy (v1 → v2 progression):
v1 (TU split + hot/cold attrs + optional gc-sections):
- Target: +2% throughput (GO threshold, realistic for layout tweaks)
- Secondary: I-cache -10%, instructions -5% (direction confirmation)
- Risk: Low (reversible via build knob)
- Expected: +0-2% (NEUTRAL likely, but validates approach)
v2 (BENCH_MINIMAL compile-out):
- Target: +10-20% throughput (本命)
- Method: Conditional compilation removes stats/ENV/debug from hot path
- Expected: Instruction count -30-40% → significant I-cache improvement
Files:
- docs/analysis/PHASE18_*.md (design/instructions)
- CURRENT_TASK.md (Phase 17 complete, Phase 18 v1/v2 plan)
Build gate: HOT_TEXT_ISOLATION=0/1 (Makefile knob)
Next: Implement Phase 18 v1 (TU split first, BENCH_MINIMAL if v1 NEUTRAL)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
|
2025-12-15 05:25:47 +09:00 |
|
|
|
cbb35ee27f
|
Phase 13 v1 + E5-2 retest: Both NEUTRAL, freeze as research boxes
Phase 13 v1: Header Write Elimination (C7 preserve header)
- Verdict: NEUTRAL (+0.78%)
- Implementation: HAKMEM_TINY_C7_PRESERVE_HEADER ENV gate (default OFF)
- Makes C7 nextptr offset conditional (0→1 when enabled)
- 4-point matrix A/B test results:
* Case A (baseline): 51.49M ops/s
* Case B (WRITE_ONCE=1): 52.07M ops/s (+1.13%)
* Case C (C7_PRESERVE=1): 51.36M ops/s (-0.26%)
* Case D (both): 51.89M ops/s (+0.78% NEUTRAL)
- Action: Freeze as research box (default OFF, manual opt-in)
Phase 5 E5-2: Header Write-Once retest (promotion test)
- Verdict: NEUTRAL (+0.54%)
- Motivation: Phase 13 Case B showed +1.13%, re-tested with dedicated 20-run
- Results (20-run):
* Case A (baseline): 51.10M ops/s
* Case B (WRITE_ONCE=1): 51.37M ops/s (+0.54%)
- Previous test: +0.45% (consistent with NEUTRAL)
- Action: Keep as research box (default OFF, manual opt-in)
Key findings:
- Header write tax optimization shows consistent NEUTRAL results
- Neither Phase 13 v1 nor E5-2 reaches GO threshold (+1.0%)
- Both implemented as reversible ENV gates for future research
Files changed:
- New: core/box/tiny_c7_preserve_header_env_box.{c,h}
- Modified: core/box/tiny_layout_box.h (C7 offset conditional)
- Modified: core/tiny_nextptr.h, core/box/tiny_header_box.h (comments)
- Modified: core/bench_profile.h (refresh sync)
- Modified: Makefile (add new .o files)
- Modified: scripts/run_mixed_10_cleanenv.sh (add C7_PRESERVE ENV)
- Docs: PHASE13_*, PHASE5_E5_2_HEADER_WRITE_ONCE_* (design/results)
Next: Phase 14 (Pointer-chase reduction, tcache-style intrusive LIFO)
🤖 Generated with Claude Code
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
|
2025-12-15 00:32:25 +09:00 |
|
|
|
71b1354d32
|
Phase 10: FREE-TINY-FAST MONO LEGACY DIRECT (GO +1.89%)
Results:
- A/B test: +1.89% on Mixed (10-run, clean env)
- Baseline: 51.96M ops/s
- Optimized: 52.94M ops/s
- Improvement: +984K ops/s (+1.89%)
- C6-heavy verification: +7.86% (nonlegacy_mask works correctly, no misfires)
Strategy:
- Extend Phase 9 (C0-C3 DUALHOT) to C4-C7 LEGACY DIRECT
- Fail-Fast principle: Never misclassify MID/ULTRA/V7 as LEGACY
- nonlegacy_mask: Cached at init, hot path uses single bit operation
Success factors:
1. Performance improvement: +1.89% (1.9x GO threshold)
2. Safety verified: nonlegacy_mask prevents MID v3 misfire in C6-heavy
3. Phase 9 coexistence: C0-C3 (Phase 9) + C4-C7 (Phase 10) = full LEGACY coverage
4. Minimal overhead: Single bit operation in hot path (mask & (1u<<class))
Implementation:
- Patch 1: ENV gate box (free_tiny_fast_mono_legacy_direct_env_box.h)
- ENV: HAKMEM_FREE_TINY_FAST_MONO_LEGACY_DIRECT=0/1 (default 0)
- nonlegacy_mask cached (reuses free_policy_fast_v2_nonlegacy_mask())
- Probe window: 64 (avoid bench_profile putenv race)
- Patch 2: Early-exit in free_tiny_fast() (malloc_tiny_fast.h)
- Conditions: !nonlegacy_mask, route==LEGACY, !LARSON_FIX, done==1
- Direct call: tiny_legacy_fallback_free_base()
- Patch 3: Visibility (free_path_stats_box.h)
- mono_legacy_direct_hit counter (compile-out in release)
- Patch 4: cleanenv extension (run_mixed_10_cleanenv.sh)
- ENV leak protection
Safety verification (C6-heavy):
- OFF: 19.75M ops/s
- ON: 21.30M ops/s (+7.86%)
- nonlegacy_mask correctly excludes C6 (MID v3 active)
- Improvement from C0-C5, C7 direct path acceleration
Files modified:
- core/bench_profile.h: add to MIXED_TINYV3_C7_SAFE preset
- core/front/malloc_tiny_fast.h: early-exit insertion
- core/box/free_path_stats_box.h: counter
- core/box/free_tiny_fast_mono_legacy_direct_env_box.h: NEW (ENV gate + nonlegacy_mask)
- scripts/run_mixed_10_cleanenv.sh: ENV leak protection
Health check: PASSED (all profiles)
Promotion: Added to MIXED_TINYV3_C7_SAFE preset (default ON, opt-out)
Rollback: HAKMEM_FREE_TINY_FAST_MONO_LEGACY_DIRECT=0
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
|
2025-12-14 20:09:40 +09:00 |
|
|
|
871034da1f
|
Phase 9: FREE-TINY-FAST MONO DUALHOT (GO +2.72%)
Results:
- A/B test: +2.72% on Mixed (10-run, clean env)
- Baseline: 48.89M ops/s
- Optimized: 50.22M ops/s
- Improvement: +1.33M ops/s (+2.72%)
- Stability: Standard deviation reduced by 60.8% (2.44M → 955K ops/s)
Strategy:
- Transplant C0-C3 "second hot" path to monolithic free_tiny_fast()
- Early-exit within monolithic (no hot/cold split)
- FastLane free now benefits from C0-C3 direct path
Success factors:
1. Performance improvement: +2.72% (2.7x GO threshold)
2. Stability improvement: 2.6x more stable (stdev 60.8% reduction)
3. Learned from Phase 7 failure:
- Phase 7: Function split (hot/cold) → NO-GO
- Phase 9: Early-exit within monolithic → GO
4. FastLane free compatibility: C0-C3 direct path now works with FastLane
5. Policy snapshot overhead reduction: C0-C3 (48% of Mixed) skip route lookup
Implementation:
- Patch 1: ENV gate box (free_tiny_fast_mono_dualhot_env_box.h)
- ENV: HAKMEM_FREE_TINY_FAST_MONO_DUALHOT=0/1 (default 0)
- Probe window: 64 (avoid bench_profile putenv race)
- Patch 2: Early-exit in free_tiny_fast() (malloc_tiny_fast.h)
- Conditions: class_idx <= 3, !LARSON_FIX, route==LEGACY
- Direct call: tiny_legacy_fallback_free_base()
- Patch 3: Visibility (free_path_stats_box.h)
- mono_dualhot_hit counter (compile-out in release)
- Patch 4: cleanenv extension (run_mixed_10_cleanenv.sh)
- ENV leak protection
Files modified:
- core/bench_profile.h: add to MIXED_TINYV3_C7_SAFE preset
- core/front/malloc_tiny_fast.h: early-exit insertion
- core/box/free_path_stats_box.h: counter
- core/box/free_tiny_fast_mono_dualhot_env_box.h: NEW (ENV gate)
- scripts/run_mixed_10_cleanenv.sh: ENV leak protection
Health check: PASSED (all profiles)
Promotion: Added to MIXED_TINYV3_C7_SAFE preset (default ON, opt-out)
Rollback: HAKMEM_FREE_TINY_FAST_MONO_DUALHOT=0
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
|
2025-12-14 19:16:49 +09:00 |
|