Files
hakorune/docs/blueprints/strings-utf8-byte.md
Selfhosting Dev c8063c9e41 pyvm: split op handlers into ops_core/ops_box/ops_ctrl; add ops_flow + intrinsic; delegate vm.py without behavior change
net-plugin: modularize constants (consts.rs) and sockets (sockets.rs); remove legacy commented socket code; fix unused imports
mir: move instruction unit tests to tests/mir_instruction_unit.rs (file lean-up); no semantic changes
runner/pyvm: ensure using pre-strip; misc docs updates

Build: cargo build ok; legacy cfg warnings remain as before
2025-09-21 08:53:00 +09:00

62 lines
2.7 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Strings Blueprint — UTF8 First, Bytes Separate
Status: active (Phase Freeze compatible)
Updated: 2025-09-21
Purpose
- Unify string semantics by delegating StringBox public APIs to dedicated cursor boxes.
- Keep behavior stable while making codepoint vs byte decisions explicit and testable.
Pillars
- Utf8CursorBox (codepoint-oriented)
- length/indexOf/substring operate on UTF8 codepoints.
- Intended as the default delegate for StringBox public APIs.
- ByteCursorBox (byte-oriented)
- length/indexOf/substring operate on raw bytes.
- Use explicitly for byte-level parsing or binary protocols.
Delegation Strategy
- StringBox delegates to Utf8CursorBox for core methods: length/indexOf/substring.
- Provide conversion helpers: toUtf8Cursor(), toByteCursor() (thin wrappers).
- Prefer delegation over inheritance; keep “from” minimal to avoid API ambiguity.
API Semantics
- indexOf: define two flavors via the box boundary.
- StringBox.indexOf → Utf8CursorBox.indexOf (CP-based; canonical)
- ByteCursorBox.indexOf → byte-based; optin only
- substring: follow the same split (CP vs Byte). Do not mix semantics.
- Document preconditions for indices (outofrange clamped/errored per guide).
Implementation Plan (staged, nonbreaking)
1) Provide MVP cursor boxes (done)
- apps/libs/utf8_cursor.nyash
- apps/libs/byte_cursor.nyash
2) Delegate StringBox public methods to Utf8CursorBox (internal only; behavior unchanged)
- Start with length → indexOf → substring
- Add targeted smokes for edge cases (multibyte CP, boundaries)
3) Replace adhoc scans in Nyash scripts with cursor usage (MiniVM/macros)
- Migrate internal scanners (no external behavior change)
4) Introduce ByteCursorBox only where bytelevel semantics are required
- Keep call sites explicit to avoid ambiguity
Transition Gate (Rust dev only)
- Env `NYASH_STR_CP=1` enables CP semantics for legacy byte-based paths in Rust runtime (e.g., StringBox.length/indexOf/lastIndexOf).
- Default remains byte in Rust during freeze; PyVM follows CP semantics. CI smokes validate CP behavior via PyVM.
Related Docs
- reference/language/strings.md — policy & scope
- guides/language-core-and-sugar.md — core minimal + sugar
- reference/language/EBNF.md — operators (! adopted; dowhile not adopted)
- guides/loopform.md — loop normalization policy
Box Foundations (string-related)
- Utf8CursorBox, ByteCursorBox
- StringExtBox (trim/startsWith/endsWith/replace/split)
- StringBuilderBox (append/toString)
- JsonCursorBox (lightweight JSON scanning helpers)
Testing Notes
- Keep PyVM as the reference execution path.
- Add smokes: CP boundaries, mixed ASCII/nonASCII, indexOf not found, substring slices.
- Avoid perf work; focus on semantics + observability.