pyvm: split op handlers into ops_core/ops_box/ops_ctrl; add ops_flow + intrinsic; delegate vm.py without behavior change

net-plugin: modularize constants (consts.rs) and sockets (sockets.rs); remove legacy commented socket code; fix unused imports mir: move instruction unit tests to tests/mir_instruction_unit.rs (file lean-up); no semantic changes runner/pyvm: ensure using pre-strip; misc docs updates Build: cargo build ok; legacy cfg warnings remain as before
2025-09-21 08:53:00 +09:00
parent ee17cfd979
commit c8063c9e41
247 changed files with 10187 additions and 23124 deletions
--- a/docs/reference/language/strings.md
+++ b/docs/reference/language/strings.md
@ -0,0 +1,61 @@
+# Nyash Strings: UTF‑8 First, Bytes Separate
+
+Status: Design committed. This document defines how Nyash treats text vs bytes and the minimal APIs we expose in each layer.
+
+## Principles
+- UTF‑8 is the only in‑memory encoding for `StringBox`.
+- Text operations are defined in terms of Unicode code points (CP). Grapheme cluster (GC) helpers may be added on top.
+- Bytes are not text. Byte operations live in a separate `ByteCursorBox` and byte‑level instructions.
+- Conversions are explicit.
+
+## Model
+- `StringBox`: immutable UTF‑8 string value. Public text APIs are CP‑indexed.
+- `Utf8CursorBox`: delegated implementation for scanning and slicing `StringBox` as CPs.
+- `ByteCursorBox`: independent binary view/holder for byte sequences.
+
+## Invariants
+- Indices are zero‑based. Slices use half‑open intervals `[i, j)`.
+- CP APIs never intermix with byte APIs. GC APIs are explicitly suffixed (e.g., `*_gc`).
+- Conversions must be explicit. No implicit transcoding.
+
+## Core APIs (MVP)
+
+Text (UTF‑8/CP): implemented by `StringBox` delegating to `Utf8CursorBox`.
+- `length() -> i64` — number of code points.
+- `substring(i,j) -> StringBox` — CP slice.
+- `indexOf(substr, from=0) -> i64` — CP index or `-1`.
+- Optional helpers: `startsWith/endsWith/replace/split/trim` as sugar.
+
+Bytes: handled by `ByteCursorBox`.
+- `len_bytes() -> i64`
+- `slice_bytes(i,j) -> ByteCursorBox`
+- `find_bytes(pattern, from=0) -> i64`
+- `to_string_utf8(strict=true) -> StringBox | Error` — strict throws on invalid UTF‑8 (MVP may replace with U+FFFD when `strict=false`).
+
+## Errors
+- CP APIs clamp out‑of‑range indices (dev builds may enable strict). Byte APIs mirror the same behavior for byte indices.
+- `to_string_utf8(strict=true)` fails on invalid input; `strict=false` replaces invalid sequences by U+FFFD.
+
+## Interop
+- FFI/ABI boundaries use UTF‑8. Non‑UTF‑8 sources must enter via `ByteCursorBox` + explicit transcoding.
+- Other encodings (e.g., UTF‑16) are future work via separate cursor boxes; `StringBox` remains UTF‑8.
+
+## Roadmap
+1) Provide Nyash‑level MVP boxes: `Utf8CursorBox`, `ByteCursorBox`.
+2) Route `StringBox` public methods through `Utf8CursorBox`.
+3) Migrate Mini‑VM and macro scanners to use `Utf8CursorBox` helpers.
+4) Add CP/byte parity smokes; later add GC helpers and normalizers.
+
+## Proposed Convenience (design only)
+
+Parsing helpers (sugar; freeze-era design, not implemented):
+- `toDigitOrNull(base=10) -> i64 | null`
+  - Returns 0..9 when the code point is a decimal digit (or base subset), otherwise `null`.
+  - CP based; delegates to `Utf8CursorBox` to read the leading code point.
+- `toIntOrNull() -> i64 | null`
+  - Parses the leading consecutive decimal digits into an integer; returns `null` when no digit at head.
+  - Pure function; does not move any external cursor (callers decide how to advance).
+
+Notes
+- Zero new runtime opcodes; compiled as comparisons and simple arithmetic.
+- `Option/Maybe` may replace `null` in a future revision; documenting `null` keeps MVP simple.