Fix CollectionsHot type_table to resolve copy chains when propagating types through PHI nodes, enabling ArrayBox optimization in deeply nested loops (matmul). **Root Cause:** - PHI incoming values were checked directly in type_table - Copy chains (e.g., SSA 56→11) were not resolved - Types failed to propagate through deep loop nesting **Solution:** - Add copy_src_map parameter to build_type_table() - Resolve copy chains before checking if PHI incoming values are typed - Multi-pass fixpoint algorithm (up to 12 passes) ensures convergence **Impact:** - matmul: ArrayBox A/B/C now propagate through nested loops - Expected: boxcall→externcall conversion for get/set/push - Backwards compatible: opt-in (NYASH_AOT_COLLECTIONS_HOT=1), no behavior change when disabled **Analysis:** - Documented in docs/development/analysis/matmul_collections_hot_fix.md - Box→SSA flow traced: A(dst=2)→...→71(get), B(dst=3)→...→73(get), C(dst=4)→...→105(set) **Testing:** Pending verification due to VM step budget constraints 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
5.3 KiB
5.3 KiB
CollectionsHot Fix for matmul Pattern
Problem Summary
Issue: CollectionsHot pass was not rewriting ArrayBox get/set/push operations to externcalls in matmul benchmark, causing 200,000% performance degradation vs C.
Root Cause: PHI propagation in build_type_table() failed to resolve copy chains before checking if incoming PHI values were typed, causing type propagation to stop prematurely in deeply nested loop structures.
Analysis Results (Task 1)
Box→SSA Mapping for matmul
Test case: /tmp/matmul_min.hako with n=512, reps=8, A/B/C as new ArrayBox()
ArrayBox A (newbox dst=2):
- Value flow:
[2, 10, 11, 39, 40, 55, 56, 70, 71, 100, 101] - Boxcalls using A: 2
- Block 2:
box=11.push([17]) - Block 11:
box=71.get([86])
- Block 2:
ArrayBox B (newbox dst=3):
- Value flow:
[3, 12, 13, 41, 42, 57, 58, 72, 73, 102, 103] - Boxcalls using B: 2
- Block 2:
box=13.push([25]) - Block 11:
box=73.get([89])
- Block 2:
ArrayBox C (newbox dst=4):
- Value flow:
[4, 14, 15, 43, 44, 59, 60, 74, 75, 104, 105] - Boxcalls using C: 3
- Block 2:
box=15.push([30]) - Block 6:
box=44.get([135]) - Block 15:
box=105.set([124, 107])
- Block 2:
Type Table Propagation Issue
Before fix: type_table stopped at SSA 43
- Final size: 12 entries
- Entries:
{2: 'arr', 3: 'arr', 4: 'arr', 10: 'arr', 11: 'arr', 12: 'arr', 13: 'arr', 14: 'arr', 15: 'arr', 39: 'arr', 41: 'arr', 43: 'arr'}
Problem: Boxcalls used SSA 71, 73, 105 which were NOT in type_table
SSA 71 trace back to origin:
71 -> 70 -> 56 -> 55 -> 40 -> 39 -> 11 -> 10 -> 2 (newbox ArrayBox)
Blocking PHI: SSA 40
{
"dst": 40,
"incoming": [[39, 3], [56, 9]],
"op": "phi"
}
- SSA 39: in type_table ✓
- SSA 56: NOT in type_table ✗ (but resolves via copy to SSA 11 which IS typed)
Solution (Task 2)
Changes to collections_hot.hako
1. Updated build_type_table signature (line 23):
// OLD:
build_type_table(text) {
// NEW:
build_type_table(text, copy_src_map) {
2. Added copy resolution in PHI propagation (lines 62-66):
// NEW CODE:
local vid = StringHelpers.read_digits(body, posb+1)
// Resolve copy chains before checking type table
if vid != "" && copy_src_map != null {
vid = AotPrepCollectionsHotBox.resolve_copy(copy_src_map, vid)
}
if vid != "" && tmap.has(vid) {
// ... rest of logic
3. Updated call site to pass copy_src (line 485):
// OLD:
local type_table = AotPrepCollectionsHotBox.build_type_table(out)
// NEW:
// Pass copy_src to enable PHI propagation through copy chains (matmul fix)
local type_table = AotPrepCollectionsHotBox.build_type_table(out, copy_src)
Why This Works
- copy_src map is already built before calling
build_type_table(lines 448-460) - PHI propagation now resolves copy chains: when checking PHI incoming value 56, it resolves to 11, which IS in type_table
- Cascading effect: Once SSA 40 gets typed, subsequent PHIs (56, 70, 71) get typed in later passes
- Fixpoint convergence: Multi-pass algorithm (up to 12 passes) ensures deep nested loops converge
Expected Results (Task 3-5)
Task 3: Verification
- PREP MIR should show
"op":"externcall"withnyash.array.get_h,nyash.array.set_h,nyash.array.push_h - Original PREP had 0 externcalls, 1+ boxcalls
- Fixed PREP should have many externcalls, fewer boxcalls
Task 4: Performance Target
- Current: C ≈ 12ms, Hakorune EXE ≈ 25s, ratio ≈ 200,000%
- Target: ratio < 1000% (ideally < 500% after subsequent hoist/CSE passes)
- Stretch goal: ratio ≈ 125% (matching maplin's C-equivalent performance)
Task 5: Regression Checks
Must verify no degradation in:
maplin: Currently at ~100% ratio (C-equivalent)arraymap: Existing optimizations preservedlinidx: Existing optimizations preserved
Testing Commands
# Generate ORIG MIR (no AOT prep)
NYASH_SKIP_TOML_ENV=1 NYASH_DISABLE_PLUGINS=1 NYASH_PARSER_STAGE3=1 \
./target/release/hakorune --emit-mir-json tmp/matmul_orig.json /tmp/matmul_min.hako
# Generate PREP MIR (with CollectionsHot fix)
# Note: Need to use tools/hakorune_emit_mir.sh or manually apply AotPrep.run_json
# Count externcalls vs boxcalls
rg '"op":"externcall"' tmp/matmul_prep.json | wc -l
rg '"op":"boxcall"' tmp/matmul_prep.json | wc -l
# Run benchmark
NYASH_SKIP_TOML_ENV=1 NYASH_DISABLE_PLUGINS=1 \
tools/perf/microbench.sh --case matmul --backend llvm --exe --runs 3
# Regression checks
tools/perf/microbench.sh --case maplin --backend llvm --exe --runs 3
tools/perf/microbench.sh --case arraymap --backend llvm --exe --runs 3 --budget-ms 10000
tools/perf/microbench.sh --case linidx --backend llvm --exe --runs 3
Implementation Status
- ✅ Task 1: Analysis complete
- ✅ Task 2: Code changes complete (collections_hot.hako modified)
- ⏳ Task 3: Testing pending (VM step budget issues with large MIR)
- ⏳ Task 4: Performance benchmark pending
- ⏳ Task 5: Regression checks pending
Notes
- Phase 15.5 compliance: Fix follows "既定OFF" principle - only affects when
NYASH_AOT_COLLECTIONS_HOT=1is set - Backwards compatibility:
copy_src_mapparameter is optional (can be null), so existing callers still work - No behavior change: Existing maplin/arraymap/linidx should be unaffected as they don't have deep PHI chains requiring copy resolution