24 KiB
Phase 143: Canonicalizer Adaptation Range Expansion
Status
- State: 🎉 Complete (P0)
P0: parse_number Pattern - Break in THEN Clause
Objective
Expand the canonicalizer to recognize parse_number/digit collection patterns, maximizing the adaptation range before adding new lowering patterns.
Target Pattern
tools/selfhost/test_pattern2_parse_number.hako
loop(i < num_str.length()) {
local ch = num_str.substring(i, i + 1)
local digit_pos = digits.indexOf(ch)
// Exit on non-digit (break in THEN clause)
if digit_pos < 0 {
break
}
// Append digit
result = result + ch
i = i + 1
}
Pattern Characteristics
Key Difference from skip_whitespace:
- skip_whitespace:
if cond { update } else { break }- break in ELSE clause - parse_number:
if invalid_cond { break } body... update- break in THEN clause
Structure:
loop(cond) {
// ... body statements (ch, digit_pos computation)
if invalid_cond {
break
}
// ... rest statements (result append, carrier update)
carrier = carrier + const
}
Implementation Summary
1. New Recognizer (ast_feature_extractor.rs)
Added detect_parse_number_pattern():
- Detects
if cond { break }pattern (no else clause) - Extracts body statements before break check
- Extracts rest statements after break check (including carrier update)
- Returns
ParseNumberInfo { carrier_name, delta, body_stmts, rest_stmts }
Lines added: ~150 lines
2. Canonicalizer Integration (canonicalizer.rs)
- Tries parse_number pattern before skip_whitespace pattern
- Builds LoopSkeleton with:
- Step 1: HeaderCond
- Step 2: Body (statements before break)
- Step 3: Body (statements after break, excluding carrier update)
- Step 4: Update (carrier update)
- Routes to
Pattern2Break(has_break=true)
Lines modified: ~60 lines
3. Export Chain
Added exports through the module hierarchy:
ast_feature_extractor.rs→ParseNumberInfostructpatterns/mod.rs→ re-exportjoinir/mod.rs→ re-exportcontrol_flow/mod.rs→ re-exportbuilder.rs→ re-exportmir/mod.rs→ final re-export
Files modified: 6 files (8 lines total)
4. Unit Tests
Added test_parse_number_pattern_recognized() in canonicalizer.rs:
- Builds AST for parse_number pattern
- Verifies skeleton structure (4 steps)
- Verifies carrier (name="i", delta=1, role=Counter)
- Verifies exit contract (has_break=true)
- Verifies routing decision (Pattern2Break, no missing_caps)
Lines added: ~130 lines
Acceptance Criteria
- ✅ Canonicalizer creates Skeleton for parse_number loop
- ✅ RoutingDecision.chosen matches router (Pattern2Break)
- ✅ Strict parity OK (canonicalizer and router agree)
- ✅ Default behavior unchanged
- ✅ quick profile not affected
- ✅ Unit test added
- ✅ Documentation created
Results
Parity Verification
NYASH_JOINIR_DEV=1 HAKO_JOINIR_STRICT=1 ./target/release/hakorune \
tools/selfhost/test_pattern2_parse_number.hako
Output:
[loop_canonicalizer] Chosen pattern: Pattern2Break
[choose_pattern_kind/PARITY] OK: canonical and actual agree on Pattern2Break
[loop_canonicalizer/PARITY] OK in function 'main': canonical and actual agree on Pattern2Break
Status: ✅ Green parity - canonicalizer and router agree
Unit Test Results
cargo test --release --lib loop_canonicalizer::canonicalizer::tests::test_parse_number_pattern_recognized
Status: ✅ PASS
Statistics
| Metric | Count |
|---|---|
| New patterns supported | 1 (parse_number) |
| Total patterns supported | 3 (skip_whitespace, parse_number, continue) |
| New Capability Tags | 0 (uses existing ConstStep) |
| Lines added | ~280 |
| Files modified | 8 |
| Unit tests added | 1 |
| Parity status | Green ✅ |
Comparison: Parse Number vs Skip Whitespace
| Aspect | Skip Whitespace | Parse Number |
|---|---|---|
| Break location | ELSE clause | THEN clause |
| Pattern | if cond { update } else { break } |
if invalid { break } rest... update |
| Body before if | Optional | Optional (ch, digit_pos) |
| Body after if | None (last statement) | Required (result append) |
| Carrier update | In THEN clause | After if statement |
| Routing | Pattern2Break | Pattern2Break |
| Example | skip_whitespace, trim_leading/trailing | parse_number, digit collection |
Follow-up Opportunities
Immediate (Phase 143 P1-P2)
- Support parse_string pattern (continue + return combo)
- Add capability for variable-step updates (escape handling)
Future Enhancements
- Extend recognizer for nested if patterns
- Support multiple break points (requires new capability)
- Add signature-based corpus analysis
Lessons Learned
- Break location matters: THEN vs ELSE clause creates different patterns
- rest_stmts extraction: Need to carefully separate body from carrier update
- Export chain: Requires 6-level re-export (ast → patterns → joinir → control_flow → builder → mir)
- Parity first: Always verify strict parity before claiming success
SSOT
- Design:
docs/development/current/main/design/loop-canonicalizer.md - Recognizer:
src/mir/builder/control_flow/joinir/patterns/ast_feature_extractor.rs - Canonicalizer:
src/mir/loop_canonicalizer/canonicalizer.rs - Tests: Test file
tools/selfhost/test_pattern2_parse_number.hako
P1: parse_string Pattern - Continue + Return Combo
Status
✅ Complete (2025-12-16)
Objective
Expand canonicalizer to recognize parse_string patterns with both continue (escape handling) and return (quote found).
Target Pattern
tools/selfhost/test_pattern4_parse_string.hako
loop(p < len) {
local ch = s.substring(p, p + 1)
// Check for closing quote (return)
if ch == "\"" {
return 0
}
// Check for escape sequence (continue)
if ch == "\\" {
result = result + ch
p = p + 1
if p < len { // Nested if
result = result + s.substring(p, p + 1)
p = p + 1
continue // Nested continue
}
}
// Regular character
result = result + ch
p = p + 1
}
Pattern Characteristics
Key Features:
- Multiple exit types: both
returnandcontinue - Nested control flow: continue is inside a nested
if - Variable step updates:
p++normally, butp += 2on escape
Structure:
loop(cond) {
// ... body statements (ch computation)
if quote_cond {
return result
}
if escape_cond {
// ... escape handling
carrier = carrier + step
if nested_cond {
// ... nested handling
carrier = carrier + step
continue // Nested continue!
}
}
// ... regular processing
carrier = carrier + step
}
Implementation Summary
1. New Recognizer (ast_feature_extractor.rs)
Added detect_parse_string_pattern():
- Detects
if cond { return }pattern - Detects
continuestatement (with recursive search for nested continue) - Uses
has_continue_node()helper for deep search - Returns
ParseStringInfo { carrier_name, delta, body_stmts }
Lines added: ~120 lines
2. Canonicalizer Integration (canonicalizer.rs)
- Tries parse_string pattern first (most specific)
- Builds LoopSkeleton with:
- Step 1: HeaderCond
- Step 2: Body (statements before exit checks)
- Step 3: Update (carrier update)
- Sets ExitContract:
has_break = falsehas_continue = truehas_return = true
- Routes to
Pattern4Continue(has both continue and return)
Lines modified: ~45 lines
3. Export Chain
Added exports through the module hierarchy:
ast_feature_extractor.rs→ParseStringInfostruct +detect_parse_string_pattern()patterns/mod.rs→ re-exportjoinir/mod.rs→ re-exportcontrol_flow/mod.rs→ re-exportbuilder.rs→ re-exportmir/mod.rs→ final re-export
Files modified: 7 files (10 lines total)
4. Unit Tests
Added test_parse_string_pattern_recognized() in canonicalizer.rs:
- Builds AST for parse_string pattern
- Verifies skeleton structure (3 steps minimum)
- Verifies carrier (name="p", delta=1, role=Counter)
- Verifies exit contract (has_continue=true, has_return=true, has_break=false)
- Verifies routing decision (Pattern4Continue, no missing_caps)
Lines added: ~180 lines
Acceptance Criteria
- ✅ Canonicalizer creates Skeleton for parse_string loop
- ✅ RoutingDecision.chosen matches router (Pattern4Continue)
- ✅ Strict parity green (canonicalizer and router agree)
- ✅ Default behavior unchanged
- ✅ quick profile not affected (unrelated smoke test failure)
- ✅ Unit test added and passing
- ✅ Nested continue detection implemented
Results
Parity Verification
NYASH_JOINIR_DEV=1 HAKO_JOINIR_STRICT=1 ./target/release/hakorune \
tools/selfhost/test_pattern4_parse_string.hako
Output:
[loop_canonicalizer] Skeleton steps: 3
[loop_canonicalizer] Carriers: 1
[loop_canonicalizer] Has exits: true
[loop_canonicalizer] Decision: SUCCESS
[loop_canonicalizer] Chosen pattern: Pattern4Continue
[loop_canonicalizer] Missing caps: []
[loop_canonicalizer/PARITY] OK in function 'main': canonical and actual agree on Pattern4Continue
Status: ✅ Green parity - canonicalizer and router agree on Pattern4Continue
Unit Test Results
cargo test --release --lib loop_canonicalizer --release
Status: ✅ All 19 tests PASS
Statistics
| Metric | Count |
|---|---|
| New patterns supported | 1 (parse_string) |
| Total patterns supported | 4 (skip_whitespace, parse_number, continue, parse_string) |
| New Capability Tags | 0 (uses existing ConstStep) |
| Lines added | ~300 |
| Files modified | 9 |
| Unit tests added | 1 |
| Parity status | Green ✅ |
Technical Challenges
- Nested Continue Detection: Required using
has_continue_node()recursive helper instead of shallow iteration - Complex Exit Contract: First pattern with both
has_continue=trueANDhas_return=true - Variable Step Updates: The actual loop has variable steps (p++ vs p+=2), but canonicalizer uses base delta=1
Comparison: Parse String vs Other Patterns
| Aspect | Skip Whitespace | Parse Number | Continue | Parse String |
|---|---|---|---|---|
| Break | Yes (ELSE) | Yes (THEN) | No | No |
| Continue | No | No | Yes | Yes |
| Return | No | No | No | Yes |
| Nested control | No | No | No | Yes (nested if + continue) |
| Routing | Pattern2Break | Pattern2Break | Pattern4Continue | Pattern4Continue |
Follow-up Opportunities
Next Steps (Phase 143 P2-P3)
- Support parse_array pattern (array element collection)
- Support parse_object pattern (key-value pair collection)
- Add capability for true variable-step updates (not just const delta)
Future Enhancements
- Support multiple return points
- Handle more complex nested patterns
- Add signature-based corpus analysis for pattern discovery
Lessons Learned
- Nested Detection Required: Simple shallow iteration isn't enough for real-world patterns
- ExitContract Diversity: Patterns can have multiple exit types simultaneously
- Parity vs Execution: Achieving parity doesn't guarantee runtime success (Pattern4 lowering may need enhancements)
- Recursive Helpers: Reusing existing helpers (
has_continue_node) is better than duplicating logic
P2: parse_array Pattern - Separator + Stop Combo
Status
✅ Complete (2025-12-16)
Objective
Extend canonicalizer to recognize parse_array patterns with both continue (separator handling) and return (stop condition).
Target Pattern
tools/selfhost/test_pattern4_parse_array.hako
loop(p < len) {
local ch = s.substring(p, p + 1)
// Check for array end (return)
if ch == "]" {
if elem.length() > 0 {
arr.push(elem)
}
return 0
}
// Check for separator (continue)
if ch == "," {
if elem.length() > 0 {
arr.push(elem)
elem = ""
}
p = p + 1
continue
}
// Accumulate element
elem = elem + ch
p = p + 1
}
Pattern Characteristics
Key Features:
- Multiple exit types: both
return(stop condition) andcontinue(separator) - Separator handling:
,triggers element save and continue - Stop condition:
]triggers final save and return - Same structural pattern as parse_string
Structure:
loop(cond) {
// ... body statements (ch computation)
if stop_cond { // ']' for array
// ... save final element
return result
}
if separator_cond { // ',' for array
// ... save element, reset accumulator
carrier = carrier + step
continue
}
// ... accumulate element
carrier = carrier + step
}
Implementation Summary
Key Discovery: Shared Pattern with parse_string
No new recognizer needed! The existing detect_parse_string_pattern() already handles both patterns:
- Both have
returnstatement (stop condition) - Both have
continuestatement (separator/escape) - Both have carrier updates
- Only semantic difference is what the conditions check for
Changes Made
-
Documentation Updates (~150 lines)
- Updated
ast_feature_extractor.rsto document parse_array support - Updated
pattern_recognizer.rswrapper documentation - Updated
canonicalizer.rssupported patterns list - Added parse_array example to pattern documentation
- Updated
-
Unit Test (~165 lines)
- Added
test_parse_array_pattern_recognized()incanonicalizer.rs - Mirrors parse_string test structure with array-specific conditions
- Verifies same Pattern4Continue routing
- Added
-
Error Messages (~5 lines)
- Updated error messages to mention parse_array
Total lines modified: ~320 lines (mostly documentation)
Acceptance Criteria
- ✅ Canonicalizer creates Skeleton for parse_array loop
- ✅ RoutingDecision.chosen == Pattern4Continue
- ✅ Strict parity green (canonicalizer and router agree)
- ✅ Default behavior unchanged
- ✅ Unit test added and passing
- ✅ No new capability needed
Results
Parity Verification
NYASH_JOINIR_DEV=1 HAKO_JOINIR_STRICT=1 ./target/release/hakorune \
tools/selfhost/test_pattern4_parse_array.hako
Output:
[loop_canonicalizer] Skeleton steps: 3
[loop_canonicalizer] Carriers: 1
[loop_canonicalizer] Has exits: true
[loop_canonicalizer] Decision: SUCCESS
[loop_canonicalizer] Chosen pattern: Pattern4Continue
[loop_canonicalizer] Missing caps: []
[loop_canonicalizer/PARITY] OK in function 'main': canonical and actual agree on Pattern4Continue
Status: ✅ Green parity - canonicalizer and router agree on Pattern4Continue
Unit Test Results
cargo test --release --lib loop_canonicalizer::canonicalizer::tests::test_parse_array_pattern_recognized
Status: ✅ PASS
Statistics
| Metric | Count |
|---|---|
| New patterns supported | 1 (parse_array, shares recognizer with parse_string) |
| Total patterns supported | 5 (skip_whitespace, parse_number, continue, parse_string, parse_array) |
| New Capability Tags | 0 (uses existing ConstStep) |
| Lines added | ~320 (mostly documentation) |
| Files modified | 3 (canonicalizer.rs, ast_feature_extractor.rs, pattern_recognizer.rs) |
| Unit tests added | 1 |
| Parity status | Green ✅ |
Comparison: Parse String vs Parse Array
| Aspect | Parse String | Parse Array |
|---|---|---|
| Stop condition | " (quote) |
] (array end) |
| Separator | \ (escape) |
, (element separator) |
| Structure | continue + return | continue + return |
| Recognizer | detect_parse_string_pattern() |
Same recognizer! |
| Routing | Pattern4Continue | Pattern4Continue |
| ExitContract | has_continue=true, has_return=true | has_continue=true, has_return=true |
Key Insight: Structural vs Semantic Patterns
Major Discovery: parse_string and parse_array are structurally identical at the AST level:
- Both have
if stop_cond { return } - Both have
if separator_cond { continue } - Both have carrier updates
The semantic difference (what the conditions check) doesn't matter for pattern recognition!
This demonstrates the power of AST-based pattern matching: we can recognize structural patterns without understanding their semantic meaning.
Follow-up Opportunities
Next Steps (Phase 143 P3)
- Support parse_object pattern (likely also shares the same recognizer!)
- Document pattern families (structural equivalence classes)
Future Enhancements
- Generalize to "dual-exit patterns" (continue + return)
- Add corpus analysis to discover more structural equivalences
- Create pattern taxonomy based on AST structure
Lessons Learned
- Structural Equivalence: Different semantic patterns can share the same AST structure
- Recognizer Reuse: One recognizer can handle multiple use cases
- Documentation > Code: More documentation changes than code changes
- Test Coverage: Unit tests verify both semantic variants work with the same recognizer
P3: parse_object Pattern - Key-Value Pair Collection
Status
✅ Complete (2025-12-16)
Objective
Verify that parse_object pattern (key-value pair collection) is recognized by the existing recognizer, maintaining structural equivalence with parse_string/parse_array.
Target Pattern
tools/selfhost/test_pattern4_parse_object.hako
loop(p < s.length()) {
// ... optional body statements
// Check for object end (return)
local ch = s.substring(p, p+1)
if ch == "}" {
return obj // Stop: object complete
}
// Check for separator (continue)
if ch == "," {
p = p + 1
continue // Separator: continue to next key-value pair
}
// Regular processing
p = p + 1
}
Pattern Characteristics
Key Features:
- Multiple exit types: both
return(stop condition) andcontinue(separator) - Separator handling:
,triggers continue to next pair - Stop condition:
}triggers return with result - Same structural pattern as parse_string/parse_array
Structure:
loop(cond) {
// ... body statements (ch computation)
if stop_cond { // '}' for object
return result
}
if separator_cond { // ',' for object
carrier = carrier + step
continue
}
// ... regular processing
carrier = carrier + step
}
Implementation Summary
Key Discovery: Complete Structural Equivalence
No new recognizer needed! The existing detect_parse_string_pattern() handles parse_object perfectly:
- Has
returnstatement (stop condition:}) - Has
continuestatement (separator:,) - Has carrier updates (
p = p + 1) - Only semantic difference is the stop/separator characters
Pattern Family Confirmed: parse_string, parse_array, and parse_object are structurally identical.
Changes Made
-
Test File Creation (~50 lines)
- Created
tools/selfhost/test_pattern4_parse_object.hako - Minimal test demonstrating parse_object loop structure
- Created
-
Unit Test (~170 lines)
- Added
test_parse_object_pattern_recognized()incanonicalizer.rs - Mirrors parse_array test structure with object-specific conditions (
}and,) - Verifies same Pattern4Continue routing
- Added
-
Documentation (this section)
Total implementation: ~220 lines (no new recognizer code needed!)
Acceptance Criteria
- ✅ Canonicalizer creates Skeleton for parse_object loop
- ✅ RoutingDecision.chosen == Pattern4Continue
- ✅ RoutingDecision.missing_caps == []
- ✅ Strict parity green (canonicalizer and router agree)
- ✅ Default behavior unchanged
- ✅ Unit test added and passing
- ✅ No new capability needed
- ✅ No new recognizer needed (existing recognizer handles it)
Results
Parity Verification
NYASH_JOINIR_DEV=1 HAKO_JOINIR_STRICT=1 ./target/release/hakorune \
tools/selfhost/test_pattern4_parse_object.hako
Output:
[loop_canonicalizer] Chosen pattern: Pattern4Continue
[choose_pattern_kind/PARITY] OK: canonical and actual agree on Pattern4Continue
[loop_canonicalizer/PARITY] OK in function 'Main.parse_object_loop/0': canonical and actual agree on Pattern4Continue
Status: ✅ Green parity - canonicalizer and router agree on Pattern4Continue
Unit Test Results
cargo test --release --lib loop_canonicalizer::canonicalizer::tests::test_parse_object_pattern_recognized
Status: ✅ PASS
Statistics
| Metric | Count |
|---|---|
| New patterns supported | 1 (parse_object, shares recognizer with parse_string/array) |
| Total patterns supported | 6 (skip_whitespace, parse_number, continue, parse_string, parse_array, parse_object) |
| New Capability Tags | 0 (uses existing ConstStep) |
| Lines added | ~220 (test file + unit test + docs) |
| Files modified | 2 (canonicalizer.rs, new test file) |
| Unit tests added | 1 |
| Parity status | Green ✅ |
| New recognizer code | 0 lines (complete reuse!) |
Comparison: Parse String vs Parse Array vs Parse Object
| Aspect | Parse String | Parse Array | Parse Object |
|---|---|---|---|
| Stop condition | " (quote) |
] (array end) |
} (object end) |
| Separator | \ (escape) |
, (element separator) |
, (pair separator) |
| Structure | continue + return | continue + return | continue + return |
| Recognizer | detect_parse_string_pattern() |
Same | Same |
| Routing | Pattern4Continue | Pattern4Continue | Pattern4Continue |
| ExitContract | has_continue=true, has_return=true | Same | Same |
Key Insight: Structural Pattern Family
Major Discovery: parse_string, parse_array, and parse_object form a structural pattern family:
- All have
if stop_cond { return } - All have
if separator_cond { continue } - All have carrier updates
- One recognizer handles all three!
The semantic differences (string quote vs array bracket vs object brace) are invisible at the AST structure level.
Implication: AST-based pattern matching creates natural pattern families. When we implement one pattern, we often get multiple variants "for free".
Coverage Expansion Summary
Phase 143 started with 3 patterns (skip_whitespace, parse_number, continue) and expanded to 6 patterns:
- P0: Added parse_number (new recognizer)
- P1: Added parse_string (new recognizer)
- P2: Added parse_array (reused parse_string recognizer)
- P3: Added parse_object (reused parse_string recognizer)
Recognizer efficiency: 2 new recognizers → 4 new patterns supported!
Follow-up Opportunities
Next Steps (Phase 144+)
- Document pattern families in design docs
- Add corpus analysis to discover more structural equivalences
- Create pattern taxonomy based on AST structure
- Explore other potential pattern families
Future Enhancements
- Generalize to "dual-exit patterns" (continue + return)
- Support triple-exit patterns (break + continue + return)
- Add signature-based pattern discovery
Lessons Learned
- Pattern Families: Structural equivalence creates natural groupings
- Recognizer Reuse: Testing existing recognizers before writing new ones saves effort
- Semantic vs Structural: AST patterns are structural; semantic meaning doesn't affect recognition
- Test-Driven Discovery: Unit tests verify recognizer generality
- Documentation Value: Recording discoveries helps future pattern work
Phase 143 P0: Complete ✅ Phase 143 P1: Complete ✅ Phase 143 P2: Complete ✅ Phase 143 P3: Complete ✅ Date: 2025-12-16 Implemented by: Claude Code (Sonnet 4.5)