Files
hakorune/docs/reference/language/strings.md
Selfhosting Dev c8063c9e41 pyvm: split op handlers into ops_core/ops_box/ops_ctrl; add ops_flow + intrinsic; delegate vm.py without behavior change
net-plugin: modularize constants (consts.rs) and sockets (sockets.rs); remove legacy commented socket code; fix unused imports
mir: move instruction unit tests to tests/mir_instruction_unit.rs (file lean-up); no semantic changes
runner/pyvm: ensure using pre-strip; misc docs updates

Build: cargo build ok; legacy cfg warnings remain as before
2025-09-21 08:53:00 +09:00

3.0 KiB
Raw Blame History

Nyash Strings: UTF8 First, Bytes Separate

Status: Design committed. This document defines how Nyash treats text vs bytes and the minimal APIs we expose in each layer.

Principles

  • UTF8 is the only inmemory encoding for StringBox.
  • Text operations are defined in terms of Unicode code points (CP). Grapheme cluster (GC) helpers may be added on top.
  • Bytes are not text. Byte operations live in a separate ByteCursorBox and bytelevel instructions.
  • Conversions are explicit.

Model

  • StringBox: immutable UTF8 string value. Public text APIs are CPindexed.
  • Utf8CursorBox: delegated implementation for scanning and slicing StringBox as CPs.
  • ByteCursorBox: independent binary view/holder for byte sequences.

Invariants

  • Indices are zerobased. Slices use halfopen intervals [i, j).
  • CP APIs never intermix with byte APIs. GC APIs are explicitly suffixed (e.g., *_gc).
  • Conversions must be explicit. No implicit transcoding.

Core APIs (MVP)

Text (UTF8/CP): implemented by StringBox delegating to Utf8CursorBox.

  • length() -> i64 — number of code points.
  • substring(i,j) -> StringBox — CP slice.
  • indexOf(substr, from=0) -> i64 — CP index or -1.
  • Optional helpers: startsWith/endsWith/replace/split/trim as sugar.

Bytes: handled by ByteCursorBox.

  • len_bytes() -> i64
  • slice_bytes(i,j) -> ByteCursorBox
  • find_bytes(pattern, from=0) -> i64
  • to_string_utf8(strict=true) -> StringBox | Error — strict throws on invalid UTF8 (MVP may replace with U+FFFD when strict=false).

Errors

  • CP APIs clamp outofrange indices (dev builds may enable strict). Byte APIs mirror the same behavior for byte indices.
  • to_string_utf8(strict=true) fails on invalid input; strict=false replaces invalid sequences by U+FFFD.

Interop

  • FFI/ABI boundaries use UTF8. NonUTF8 sources must enter via ByteCursorBox + explicit transcoding.
  • Other encodings (e.g., UTF16) are future work via separate cursor boxes; StringBox remains UTF8.

Roadmap

  1. Provide Nyashlevel MVP boxes: Utf8CursorBox, ByteCursorBox.
  2. Route StringBox public methods through Utf8CursorBox.
  3. Migrate MiniVM and macro scanners to use Utf8CursorBox helpers.
  4. Add CP/byte parity smokes; later add GC helpers and normalizers.

Proposed Convenience (design only)

Parsing helpers (sugar; freeze-era design, not implemented):

  • toDigitOrNull(base=10) -> i64 | null
    • Returns 0..9 when the code point is a decimal digit (or base subset), otherwise null.
    • CP based; delegates to Utf8CursorBox to read the leading code point.
  • toIntOrNull() -> i64 | null
    • Parses the leading consecutive decimal digits into an integer; returns null when no digit at head.
    • Pure function; does not move any external cursor (callers decide how to advance).

Notes

  • Zero new runtime opcodes; compiled as comparisons and simple arithmetic.
  • Option/Maybe may replace null in a future revision; documenting null keeps MVP simple.