tomoaki/hakorune

Files

Selfhosting Dev 8e4cadd349 selfhost(pyvm): MiniVmPrints – prefer JSON route early-return (ok==1) to avoid fallback loops; keep default behavior unchanged elsewhere

2025-09-22 07:54:25 +09:00

3.0 KiB

Raw Blame History

Nyash Strings: UTF‑8 First, Bytes Separate

Status: Design committed. This document defines how Nyash treats text vs bytes and the minimal APIs we expose in each layer.

Principles

UTF‑8 is the only in‑memory encoding for StringBox.
Text operations are defined in terms of Unicode code points (CP). Grapheme cluster (GC) helpers may be added on top.
Bytes are not text. Byte operations live in a separate ByteCursorBox and byte‑level instructions.
Conversions are explicit.

Model

StringBox: immutable UTF‑8 string value. Public text APIs are CP‑indexed.
Utf8CursorBox: delegated implementation for scanning and slicing StringBox as CPs.
ByteCursorBox: independent binary view/holder for byte sequences.

Invariants

Indices are zero‑based. Slices use half‑open intervals [i, j).
CP APIs never intermix with byte APIs. GC APIs are explicitly suffixed (e.g., *_gc).
Conversions must be explicit. No implicit transcoding.

Core APIs (MVP)

Text (UTF‑8/CP): implemented by StringBox delegating to Utf8CursorBox.

length() -> i64 — number of code points.
substring(i,j) -> StringBox — CP slice.
indexOf(substr, from=0) -> i64 — CP index or -1.
Optional helpers: startsWith/endsWith/replace/split/trim as sugar.

Bytes: handled by ByteCursorBox.

len_bytes() -> i64
slice_bytes(i,j) -> ByteCursorBox
find_bytes(pattern, from=0) -> i64
to_string_utf8(strict=true) -> StringBox | Error — strict throws on invalid UTF‑8 (MVP may replace with U+FFFD when strict=false).

Errors

CP APIs clamp out‑of‑range indices (dev builds may enable strict). Byte APIs mirror the same behavior for byte indices.
to_string_utf8(strict=true) fails on invalid input; strict=false replaces invalid sequences by U+FFFD.

Interop

FFI/ABI boundaries use UTF‑8. Non‑UTF‑8 sources must enter via ByteCursorBox + explicit transcoding.
Other encodings (e.g., UTF‑16) are future work via separate cursor boxes; StringBox remains UTF‑8.

Roadmap

Provide Nyash‑level MVP boxes: Utf8CursorBox, ByteCursorBox.
Route StringBox public methods through Utf8CursorBox.
Migrate Mini‑VM and macro scanners to use Utf8CursorBox helpers.
Add CP/byte parity smokes; later add GC helpers and normalizers.

Proposed Convenience (design only)

Parsing helpers (sugar; documented during feature‑pause, not implemented):

toDigitOrNull(base=10) -> i64 | null
- Returns 0..9 when the code point is a decimal digit (or base subset), otherwise null.
- CP based; delegates to Utf8CursorBox to read the leading code point.
toIntOrNull() -> i64 | null
- Parses the leading consecutive decimal digits into an integer; returns null when no digit at head.
- Pure function; does not move any external cursor (callers decide how to advance).

Notes

Zero new runtime opcodes; compiled as comparisons and simple arithmetic.
Option/Maybe may replace null in a future revision; documenting null keeps MVP simple.