2025-09-26 00:27:02 +09:00
|
|
|
|
# Strings Blueprint — UTF‑8 First, Bytes Separate
|
|
|
|
|
|
|
|
|
|
|
|
Status: active (Feature‑Pause compatible)
|
|
|
|
|
|
Updated: 2025-09-21
|
|
|
|
|
|
|
|
|
|
|
|
Purpose
|
|
|
|
|
|
- Unify string semantics by delegating StringBox public APIs to dedicated cursor boxes.
|
|
|
|
|
|
- Keep behavior stable while making codepoint vs byte decisions explicit and testable.
|
|
|
|
|
|
|
|
|
|
|
|
Pillars
|
|
|
|
|
|
- Utf8CursorBox (codepoint-oriented)
|
|
|
|
|
|
- length/indexOf/substring operate on UTF‑8 codepoints.
|
|
|
|
|
|
- Intended as the default delegate for StringBox public APIs.
|
|
|
|
|
|
- ByteCursorBox (byte-oriented)
|
|
|
|
|
|
- length/indexOf/substring operate on raw bytes.
|
|
|
|
|
|
- Use explicitly for byte-level parsing or binary protocols.
|
|
|
|
|
|
|
|
|
|
|
|
Delegation Strategy
|
|
|
|
|
|
- StringBox delegates to Utf8CursorBox for core methods: length/indexOf/substring.
|
|
|
|
|
|
- Provide conversion helpers: toUtf8Cursor(), toByteCursor() (thin wrappers).
|
|
|
|
|
|
- Prefer delegation over inheritance; keep “from” minimal to avoid API ambiguity.
|
|
|
|
|
|
|
|
|
|
|
|
API Semantics
|
|
|
|
|
|
- indexOf: define two flavors via the box boundary.
|
|
|
|
|
|
- StringBox.indexOf → Utf8CursorBox.indexOf (CP-based; canonical)
|
|
|
|
|
|
- ByteCursorBox.indexOf → byte-based; opt‑in only
|
|
|
|
|
|
- substring: follow the same split (CP vs Byte). Do not mix semantics.
|
|
|
|
|
|
- Document preconditions for indices (out‑of‑range clamped/errored per guide).
|
|
|
|
|
|
|
|
|
|
|
|
Implementation Plan (staged, non‑breaking)
|
|
|
|
|
|
1) Provide MVP cursor boxes (done)
|
2025-11-06 15:41:52 +09:00
|
|
|
|
- apps/libs/utf8_cursor.hako
|
|
|
|
|
|
- apps/libs/byte_cursor.hako
|
2025-09-26 00:27:02 +09:00
|
|
|
|
2) Delegate StringBox public methods to Utf8CursorBox (internal only; behavior unchanged)
|
|
|
|
|
|
- Start with length → indexOf → substring
|
|
|
|
|
|
- Add targeted smokes for edge cases (multi‑byte CP, boundaries)
|
|
|
|
|
|
3) Replace ad‑hoc scans in Nyash scripts with cursor usage (Mini‑VM/macros)
|
|
|
|
|
|
- Migrate internal scanners (no external behavior change)
|
|
|
|
|
|
4) Introduce ByteCursorBox only where byte‑level semantics are required
|
|
|
|
|
|
- Keep call sites explicit to avoid ambiguity
|
|
|
|
|
|
|
|
|
|
|
|
Transition Gate (Rust dev only)
|
|
|
|
|
|
- Env `NYASH_STR_CP=1` enables CP semantics for legacy byte-based paths in Rust runtime (e.g., StringBox.length/indexOf/lastIndexOf).
|
|
|
|
|
|
- Default remains byte in Rust during the feature‑pause; PyVM follows CP semantics. CI smokes validate CP behavior via PyVM.
|
|
|
|
|
|
|
|
|
|
|
|
Related Docs
|
|
|
|
|
|
- reference/language/strings.md — policy & scope
|
|
|
|
|
|
- guides/language-core-and-sugar.md — core minimal + sugar
|
|
|
|
|
|
- reference/language/EBNF.md — operators (! adopted; do‑while not adopted)
|
|
|
|
|
|
- guides/loopform.md — loop normalization policy
|
|
|
|
|
|
|
|
|
|
|
|
Box Foundations (string-related)
|
|
|
|
|
|
- Utf8CursorBox, ByteCursorBox
|
|
|
|
|
|
- StringExtBox (trim/startsWith/endsWith/replace/split)
|
|
|
|
|
|
- StringBuilderBox (append/toString)
|
|
|
|
|
|
- JsonCursorBox (lightweight JSON scanning helpers)
|
|
|
|
|
|
|
|
|
|
|
|
Testing Notes
|
|
|
|
|
|
- Keep PyVM as the reference execution path.
|
|
|
|
|
|
- Add smokes: CP boundaries, mixed ASCII/non‑ASCII, indexOf not found, substring slices.
|
|
|
|
|
|
- Avoid perf work; focus on semantics + observability.
|