Files
hakorune/docs/development/design/blueprints/strings-utf8-byte.md

62 lines
2.8 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Strings Blueprint — UTF8 First, Bytes Separate
Status: active (FeaturePause compatible)
Updated: 2025-09-21
Purpose
- Unify string semantics by delegating StringBox public APIs to dedicated cursor boxes.
- Keep behavior stable while making codepoint vs byte decisions explicit and testable.
Pillars
- Utf8CursorBox (codepoint-oriented)
- length/indexOf/substring operate on UTF8 codepoints.
- Intended as the default delegate for StringBox public APIs.
- ByteCursorBox (byte-oriented)
- length/indexOf/substring operate on raw bytes.
- Use explicitly for byte-level parsing or binary protocols.
Delegation Strategy
- StringBox delegates to Utf8CursorBox for core methods: length/indexOf/substring.
- Provide conversion helpers: toUtf8Cursor(), toByteCursor() (thin wrappers).
- Prefer delegation over inheritance; keep “from” minimal to avoid API ambiguity.
API Semantics
- indexOf: define two flavors via the box boundary.
- StringBox.indexOf → Utf8CursorBox.indexOf (CP-based; canonical)
- ByteCursorBox.indexOf → byte-based; optin only
- substring: follow the same split (CP vs Byte). Do not mix semantics.
- Document preconditions for indices (outofrange clamped/errored per guide).
Implementation Plan (staged, nonbreaking)
1) Provide MVP cursor boxes (done)
- apps/libs/utf8_cursor.hako
- apps/libs/byte_cursor.hako
2) Delegate StringBox public methods to Utf8CursorBox (internal only; behavior unchanged)
- Start with length → indexOf → substring
- Add targeted smokes for edge cases (multibyte CP, boundaries)
3) Replace adhoc scans in Nyash scripts with cursor usage (MiniVM/macros)
- Migrate internal scanners (no external behavior change)
4) Introduce ByteCursorBox only where bytelevel semantics are required
- Keep call sites explicit to avoid ambiguity
Transition Gate (Rust dev only)
- Env `NYASH_STR_CP=1` enables CP semantics for legacy byte-based paths in Rust runtime (e.g., StringBox.length/indexOf/lastIndexOf).
- Default remains byte in Rust during the featurepause; PyVM follows CP semantics. CI smokes validate CP behavior via PyVM.
Related Docs
- reference/language/strings.md — policy & scope
- guides/language-core-and-sugar.md — core minimal + sugar
- reference/language/EBNF.md — operators (! adopted; dowhile not adopted)
- guides/loopform.md — loop normalization policy
Box Foundations (string-related)
- Utf8CursorBox, ByteCursorBox
- StringExtBox (trim/startsWith/endsWith/replace/split)
- StringBuilderBox (append/toString)
- JsonCursorBox (lightweight JSON scanning helpers)
Testing Notes
- Keep PyVM as the reference execution path.
- Add smokes: CP boundaries, mixed ASCII/nonASCII, indexOf not found, substring slices.
- Avoid perf work; focus on semantics + observability.