3.0 KiB
3.0 KiB
Nyash Strings: UTF‑8 First, Bytes Separate
Status: Design committed. This document defines how Nyash treats text vs bytes and the minimal APIs we expose in each layer.
Principles
- UTF‑8 is the only in‑memory encoding for
StringBox. - Text operations are defined in terms of Unicode code points (CP). Grapheme cluster (GC) helpers may be added on top.
- Bytes are not text. Byte operations live in a separate
ByteCursorBoxand byte‑level instructions. - Conversions are explicit.
Model
StringBox: immutable UTF‑8 string value. Public text APIs are CP‑indexed.Utf8CursorBox: delegated implementation for scanning and slicingStringBoxas CPs.ByteCursorBox: independent binary view/holder for byte sequences.
Invariants
- Indices are zero‑based. Slices use half‑open intervals
[i, j). - CP APIs never intermix with byte APIs. GC APIs are explicitly suffixed (e.g.,
*_gc). - Conversions must be explicit. No implicit transcoding.
Core APIs (MVP)
Text (UTF‑8/CP): implemented by StringBox delegating to Utf8CursorBox.
length() -> i64— number of code points.substring(i,j) -> StringBox— CP slice.indexOf(substr, from=0) -> i64— CP index or-1.- Optional helpers:
startsWith/endsWith/replace/split/trimas sugar.
Bytes: handled by ByteCursorBox.
len_bytes() -> i64slice_bytes(i,j) -> ByteCursorBoxfind_bytes(pattern, from=0) -> i64to_string_utf8(strict=true) -> StringBox | Error— strict throws on invalid UTF‑8 (MVP may replace with U+FFFD whenstrict=false).
Errors
- CP APIs clamp out‑of‑range indices (dev builds may enable strict). Byte APIs mirror the same behavior for byte indices.
to_string_utf8(strict=true)fails on invalid input;strict=falsereplaces invalid sequences by U+FFFD.
Interop
- FFI/ABI boundaries use UTF‑8. Non‑UTF‑8 sources must enter via
ByteCursorBox+ explicit transcoding. - Other encodings (e.g., UTF‑16) are future work via separate cursor boxes;
StringBoxremains UTF‑8.
Roadmap
- Provide Nyash‑level MVP boxes:
Utf8CursorBox,ByteCursorBox. - Route
StringBoxpublic methods throughUtf8CursorBox. - Migrate Mini‑VM and macro scanners to use
Utf8CursorBoxhelpers. - Add CP/byte parity smokes; later add GC helpers and normalizers.
Proposed Convenience (design only)
Parsing helpers (sugar; documented during feature‑pause, not implemented):
toDigitOrNull(base=10) -> i64 | null- Returns 0..9 when the code point is a decimal digit (or base subset), otherwise
null. - CP based; delegates to
Utf8CursorBoxto read the leading code point.
- Returns 0..9 when the code point is a decimal digit (or base subset), otherwise
toIntOrNull() -> i64 | null- Parses the leading consecutive decimal digits into an integer; returns
nullwhen no digit at head. - Pure function; does not move any external cursor (callers decide how to advance).
- Parses the leading consecutive decimal digits into an integer; returns
Notes
- Zero new runtime opcodes; compiled as comparisons and simple arithmetic.
Option/Maybemay replacenullin a future revision; documentingnullkeeps MVP simple.