Files
hakorune/docs/blueprints/strings-utf8-byte.md
Selfhosting Dev c8063c9e41 pyvm: split op handlers into ops_core/ops_box/ops_ctrl; add ops_flow + intrinsic; delegate vm.py without behavior change
net-plugin: modularize constants (consts.rs) and sockets (sockets.rs); remove legacy commented socket code; fix unused imports
mir: move instruction unit tests to tests/mir_instruction_unit.rs (file lean-up); no semantic changes
runner/pyvm: ensure using pre-strip; misc docs updates

Build: cargo build ok; legacy cfg warnings remain as before
2025-09-21 08:53:00 +09:00

2.7 KiB
Raw Blame History

Strings Blueprint — UTF8 First, Bytes Separate

Status: active (Phase Freeze compatible) Updated: 2025-09-21

Purpose

  • Unify string semantics by delegating StringBox public APIs to dedicated cursor boxes.
  • Keep behavior stable while making codepoint vs byte decisions explicit and testable.

Pillars

  • Utf8CursorBox (codepoint-oriented)
    • length/indexOf/substring operate on UTF8 codepoints.
    • Intended as the default delegate for StringBox public APIs.
  • ByteCursorBox (byte-oriented)
    • length/indexOf/substring operate on raw bytes.
    • Use explicitly for byte-level parsing or binary protocols.

Delegation Strategy

  • StringBox delegates to Utf8CursorBox for core methods: length/indexOf/substring.
  • Provide conversion helpers: toUtf8Cursor(), toByteCursor() (thin wrappers).
  • Prefer delegation over inheritance; keep “from” minimal to avoid API ambiguity.

API Semantics

  • indexOf: define two flavors via the box boundary.
    • StringBox.indexOf → Utf8CursorBox.indexOf (CP-based; canonical)
    • ByteCursorBox.indexOf → byte-based; optin only
  • substring: follow the same split (CP vs Byte). Do not mix semantics.
    • Document preconditions for indices (outofrange clamped/errored per guide).

Implementation Plan (staged, nonbreaking)

  1. Provide MVP cursor boxes (done)
    • apps/libs/utf8_cursor.nyash
    • apps/libs/byte_cursor.nyash
  2. Delegate StringBox public methods to Utf8CursorBox (internal only; behavior unchanged)
    • Start with length → indexOf → substring
    • Add targeted smokes for edge cases (multibyte CP, boundaries)
  3. Replace adhoc scans in Nyash scripts with cursor usage (MiniVM/macros)
    • Migrate internal scanners (no external behavior change)
  4. Introduce ByteCursorBox only where bytelevel semantics are required
    • Keep call sites explicit to avoid ambiguity

Transition Gate (Rust dev only)

  • Env NYASH_STR_CP=1 enables CP semantics for legacy byte-based paths in Rust runtime (e.g., StringBox.length/indexOf/lastIndexOf).
  • Default remains byte in Rust during freeze; PyVM follows CP semantics. CI smokes validate CP behavior via PyVM.

Related Docs

  • reference/language/strings.md — policy & scope
  • guides/language-core-and-sugar.md — core minimal + sugar
  • reference/language/EBNF.md — operators (! adopted; dowhile not adopted)
  • guides/loopform.md — loop normalization policy

Box Foundations (string-related)

  • Utf8CursorBox, ByteCursorBox
  • StringExtBox (trim/startsWith/endsWith/replace/split)
  • StringBuilderBox (append/toString)
  • JsonCursorBox (lightweight JSON scanning helpers)

Testing Notes

  • Keep PyVM as the reference execution path.
  • Add smokes: CP boundaries, mixed ASCII/nonASCII, indexOf not found, substring slices.
  • Avoid perf work; focus on semantics + observability.