Files
hakorune/docs/reference/language/strings.md
Selfhosting Dev c8063c9e41 pyvm: split op handlers into ops_core/ops_box/ops_ctrl; add ops_flow + intrinsic; delegate vm.py without behavior change
net-plugin: modularize constants (consts.rs) and sockets (sockets.rs); remove legacy commented socket code; fix unused imports
mir: move instruction unit tests to tests/mir_instruction_unit.rs (file lean-up); no semantic changes
runner/pyvm: ensure using pre-strip; misc docs updates

Build: cargo build ok; legacy cfg warnings remain as before
2025-09-21 08:53:00 +09:00

62 lines
3.0 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Nyash Strings: UTF8 First, Bytes Separate
Status: Design committed. This document defines how Nyash treats text vs bytes and the minimal APIs we expose in each layer.
## Principles
- UTF8 is the only inmemory encoding for `StringBox`.
- Text operations are defined in terms of Unicode code points (CP). Grapheme cluster (GC) helpers may be added on top.
- Bytes are not text. Byte operations live in a separate `ByteCursorBox` and bytelevel instructions.
- Conversions are explicit.
## Model
- `StringBox`: immutable UTF8 string value. Public text APIs are CPindexed.
- `Utf8CursorBox`: delegated implementation for scanning and slicing `StringBox` as CPs.
- `ByteCursorBox`: independent binary view/holder for byte sequences.
## Invariants
- Indices are zerobased. Slices use halfopen intervals `[i, j)`.
- CP APIs never intermix with byte APIs. GC APIs are explicitly suffixed (e.g., `*_gc`).
- Conversions must be explicit. No implicit transcoding.
## Core APIs (MVP)
Text (UTF8/CP): implemented by `StringBox` delegating to `Utf8CursorBox`.
- `length() -> i64` — number of code points.
- `substring(i,j) -> StringBox` — CP slice.
- `indexOf(substr, from=0) -> i64` — CP index or `-1`.
- Optional helpers: `startsWith/endsWith/replace/split/trim` as sugar.
Bytes: handled by `ByteCursorBox`.
- `len_bytes() -> i64`
- `slice_bytes(i,j) -> ByteCursorBox`
- `find_bytes(pattern, from=0) -> i64`
- `to_string_utf8(strict=true) -> StringBox | Error` — strict throws on invalid UTF8 (MVP may replace with U+FFFD when `strict=false`).
## Errors
- CP APIs clamp outofrange indices (dev builds may enable strict). Byte APIs mirror the same behavior for byte indices.
- `to_string_utf8(strict=true)` fails on invalid input; `strict=false` replaces invalid sequences by U+FFFD.
## Interop
- FFI/ABI boundaries use UTF8. NonUTF8 sources must enter via `ByteCursorBox` + explicit transcoding.
- Other encodings (e.g., UTF16) are future work via separate cursor boxes; `StringBox` remains UTF8.
## Roadmap
1) Provide Nyashlevel MVP boxes: `Utf8CursorBox`, `ByteCursorBox`.
2) Route `StringBox` public methods through `Utf8CursorBox`.
3) Migrate MiniVM and macro scanners to use `Utf8CursorBox` helpers.
4) Add CP/byte parity smokes; later add GC helpers and normalizers.
## Proposed Convenience (design only)
Parsing helpers (sugar; freeze-era design, not implemented):
- `toDigitOrNull(base=10) -> i64 | null`
- Returns 0..9 when the code point is a decimal digit (or base subset), otherwise `null`.
- CP based; delegates to `Utf8CursorBox` to read the leading code point.
- `toIntOrNull() -> i64 | null`
- Parses the leading consecutive decimal digits into an integer; returns `null` when no digit at head.
- Pure function; does not move any external cursor (callers decide how to advance).
Notes
- Zero new runtime opcodes; compiled as comparisons and simple arithmetic.
- `Option/Maybe` may replace `null` in a future revision; documenting `null` keeps MVP simple.