diff --git a/proposals/stringref/Overview.md b/proposals/stringref/Overview.md
index 29edffa..c26efbe 100644
--- a/proposals/stringref/Overview.md
+++ b/proposals/stringref/Overview.md
@@ -25,7 +25,7 @@ find good compromises are "minimal" and "viable".
3. Allow WebAssembly implementations to efficiently represent strings
internally in either WTF-8 or WTF-16 encodings
4. Allow access to WTF-16 code units for Java, Dart, Kotlin and similar languages
- 5. Allow string literals in element sections
+ 5. Allow string literals as constant expressions
## Definitions
- *codepoint*: An integer in the range [0,0x10FFFF].
@@ -195,31 +195,39 @@ address ::= i32 | i64
Such instructions also take the memory to which to read or write as an
immediate.
+Although `stringref` is a nullable type, trap if a null `stringref`
+value reaches any instruction in this proposal. The one exception is
+`string.eq`.
+
### Creating strings
```
-wtf8_policy ::= 'utf8' | 'wtf8' | 'replace'
-(string.new_wtf8 $memory $wtf8_policy ptr:address bytes:i32)
+(string.new_utf8 $memory ptr:address bytes:i32)
+ -> str:stringref
+(string.new_lossy_utf8 $memory ptr:address bytes:i32)
+ -> str:stringref
+(string.new_wtf8 $memory ptr:address bytes:i32)
-> str:stringref
```
-Create a new string from the *`bytes`* WTF-8 bytes in memory at *`ptr`*.
+Create a new string from the *`bytes`* bytes in memory at *`ptr`*.
Out-of-bounds access will trap. The maximum value for *`bytes`* is
231–1; passing a higher value traps.
-The precise decoding semantics depend on the *`$wtf8_policy`* immediate.
+These three instructions decode the bytes in three different ways:
-For `utf8`, the bytes are decoded using a strict UTF-8 decoder. If the
-bytes are not valid UTF-8, trap.
+ * `string.new_utf8` decodes using a strict UTF-8 decoder. If the
+ bytes are not valid UTF-8, trap.
-For `wtf8`, the bytes are decoded using a strict WTF-8 decoder, which is
-like UTF-8 but also allows isolated surrogates. If the bytes are not
-valid WTF-8, trap.
+ * `string.new_lossy_utf8` decodes using a sloppy UTF-8 decoder: all
+ maximal subparts of an invalid subsequence are decoded as if they
+ were `U+FFFD` (the replacement character) instead. This instruction
+ will never trap due to a decoding error. See the section entitled
+ "U+FFFD Substitution of Maximal Subparts" in the Unicode standard,
+ version 14.0.0, page 126.
-For `replace`, all maximal subparts of an invalid subsequence are
-decoded as if they were `U+FFFD` (the replacement character) instead.
-The `replace` policy will never trap due to a decoding error. See the
-section entitled "U+FFFD Substitution of Maximal Subparts" in the
-Unicode standard, version 14.0.0, page 126.
+ * `string.new_wtf8` decodes using a strict WTF-8 decoder, which is like
+ UTF-8 but also allows isolated surrogates. If the bytes are not
+ valid WTF-8, trap.
```
(string.new_wtf16 $memory ptr:address codeunits:i32)
@@ -228,7 +236,9 @@ Unicode standard, version 14.0.0, page 126.
Create a new string from the *`codeunits`* code units encoded in memory at
*`ptr`*. Out-of-bounds access will trap. *`ptr`* must be two-byte
aligned, and will trap otherwise. The maximum value for *`codeunits`*
-is 230–1; passing a higher value traps.
+is 230–1; passing a higher value traps. Each code unit is
+read from memory as if with `i32.load16`, and is therefore decoded
+using little-endian byte order.
#### `string.new` size limits
@@ -252,7 +262,7 @@ be used in global variable initializers.
#### String literal section
The `string.const` section indicates the literal as an `i32` index into
-a new custom section: a string table, encoded as a `vec(vec(u8))` of
+a new regular section: a string table, encoded as a `vec(vec(u8))` of
valid WTF-8 strings. Because literal strings can contain codepoint 0,
strings in the string table do not use NUL as a terminator. The string
table section must immediately precede the global section, or where the
@@ -271,8 +281,8 @@ string literal section as a future extension.
The maximum size for the WTF-8 encoding of an individual string literal
is 231–1 bytes. Embeddings may impose their own limits which
-are more restricted. But similarly to `string.new`, instantiating a
-module with string literals may fail due to lack of memory resources,
+are more restricted. But similarly to `string.new_wtf8`, instantiating
+a module with string literals may fail due to lack of memory resources,
even if the string size is formally within the limits. However
`string.const` itself never traps when passed a valid literal offset.
@@ -282,44 +292,96 @@ All parameters and return values measuring a number of codepoints or a
number of code units represent these sizes as unsigned values.
```
-(string.measure_wtf8 $wtf8_policy str:stringref)
- -> bytes:i32
+(string.measure_utf8 str:stringref)
+ -> codeunits:i32
+```
+Measure the number of code units (bytes) that would be required to
+encode the contents of the string *`str`* to UTF-8. If the string
+contains an isolated surrogate, return -1.
+
+The maximum number of code units returned by `string.measure_utf8` is is
+231-1. If an encoding would require more code units than the
+limit, the result is -1.
+
+```
+(string.measure_wtf8 str:stringref)
+ -> codeunits:i32
+```
+Measure the number of code units (bytes) that would be required to
+encode the codepoints of the string *`str`* to WTF-8.
+
+Note that this instruction also serves to measure an encoding length for
+UTF-8 when isolated surrogates are replaced with `U+FFFD` ("lossy
+UTF-8"); the same number of bytes is required to encode `U+FFFD` as
+would be required to encode an isolated surrogate to WTF-8.
+
+The maximum number of code units returned by `string.measure_wtf8` is is
+231-1. If an encoding would require more code units than the
+limit, the result is -1.
+
+```
(string.measure_wtf16 str:stringref)
- -> bytes:i32
+ -> codeunits:i32
```
Measure the number of code units that would be required to encode the
-contents of the string *`str`* to WTF-8 or WTF-16 respectively.
-For `string.measure_wtf8` with the `utf8` policy, if the string contains
-an isolated surrogate, return -1.
+contents of the string *`str`* to WTF-16.
-The maximum number of code units returned by `string.measure_wtf8` is is
-231-1. The maximum number of code units returned by
-`string.measure_wtf16` is is 230-1. If an encoding would
-require more code units than the limit, the result is -1.
+The maximum number of code units returned by `string.measure_wtf16` is
+is 230-1. If an encoding would require more code units than
+the limit, the result is -1.
```
-(string.encode_wtf8 $memory $wtf8_policy str:stringref ptr:address)
-(string.encode_wtf16 $memory str:stringref ptr:address)
+(string.encode_utf8 $memory str:stringref ptr:address)
+ -> codeunits:i32
```
-Encode the contents of the string *`str`* as UTF-8, WTF-8, or WTF-16,
-respectively, to memory at *`ptr`*. The number of code units written
-will be the same as returned by the corresponding
-`string.measure_*encoding*`.
+Encode the contents of the string *`str`* as UTF-8 to memory at *ptr*.
+If an isolated surrogate is seen, trap. Return the number of code units
+written, which will be the same as returned by the corresponding
+`string.measure_utf8`.
-Each code unit is written to memory as if stored by `i32.store8` or
-`i32.store16`, respectively, so WTF-16 code units are in little-endian
-byte order.
+The maximum number of bytes that can be encoded at once by
+`string.encode` is 231-1. If an encoding would require more
+bytes, it is as if the codepoints can't be encoded (a trap).
+
+```
+(string.encode_lossy_utf8 $memory str:stringref ptr:address)
+ -> codeunits:i32
+```
+Encode the contents of the string *`str`* as UTF-8 to memory at *`ptr`*.
+If an isolated surrogate is seen, encode `U+FFFD` (the replacement
+character) instead. Return the number of code units written, which will
+be the same as returned by the corresponding `string.measure_wtf8`.
The maximum number of bytes that can be encoded at once by
`string.encode` is 231-1. If an encoding would require more
bytes, it is as if the codepoints can't be encoded (a trap).
-For `string.encode_wtf8`, if an isolated surrogate is seen, the behavior
-determines on the *`$wtf8_policy`* immediate. For `utf8`, trap. For
-`wtf8`, the surrogate is encoded as per WTF-8. For `replace`, `U+FFFD`
-(the replacement character) is encoded instead. Note that the UTF-8
-encoding of `U+FFFD` is the same length as the WTF-8 encoding of an
-isolated surrogate.
+```
+(string.encode_wtf8 $memory str:stringref ptr:address)
+ -> codeunits:i32
+```
+Encode the contents of the string *`str`* as WTF-8 to memory at *`ptr`*.
+Return the number of code units written, which will be the same as
+returned by the corresponding `string.measure_wtf8`.
+
+The maximum number of bytes that can be encoded at once by
+`string.encode` is 231-1. If an encoding would require more
+bytes, it is as if the codepoints can't be encoded (a trap).
+
+```
+(string.encode_wtf16 $memory str:stringref ptr:address)
+ -> codeunits:i32
+```
+Encode the contents of the string *`str`* as WTF-16 to memory at
+*`ptr`*. Return the number of code units written, which will be the
+same as returned by the corresponding `string.measure_wtf16`.
+
+Each code unit is written to memory as if stored by `i32.store16`, so
+WTF-16 code units are in little-endian byte order.
+
+The maximum number of bytes that can be encoded at once by
+`string.encode` is 231-1. If an encoding would require more
+bytes, it is as if the codepoints can't be encoded (a trap).
### Concatenation
@@ -344,8 +406,9 @@ If an allocation fails, the implementation must trap. Fallible
```
(string.eq a:stringref b:stringref) -> i32
```
-Return 1 if the strings *`a`* and *`b`* contain the same codepoint
-sequence. Return 0 otherwise.
+If both *`a`* and *`b`* are null, return 1. If only one of them is
+null, return 0. Otherwise return 1 if the strings *`a`* and *`b`*
+contain the same codepoint sequence, or 0 otherwise.
```
(string.is_usv_sequence str:stringref)
@@ -392,7 +455,11 @@ may allow for 64-bit variants of the position-using instructions, which
could relax this restriction.)
```
-(stringview_wtf8.encode $memory $wtf8_policy view:stringview_wtf8 ptr:address pos:i32 bytes:i32)
+(stringview_wtf8.encode_utf8 $memory view:stringview_wtf8 ptr:address pos:i32 bytes:i32)
+ -> next_pos:i32, bytes:i32
+(stringview_wtf8.encode_lossy_utf8 $memory view:stringview_wtf8 ptr:address pos:i32 bytes:i32)
+ -> next_pos:i32, bytes:i32
+(stringview_wtf8.encode_wtf8 $memory view:stringview_wtf8 ptr:address pos:i32 bytes:i32)
-> next_pos:i32, bytes:i32
```
Write a subsequence of the WTF-8 encoding of *`view`* to memory at
@@ -410,16 +477,20 @@ proposal](https://github.com/WebAssembly/memory64/blob/main/proposals/memory64/O
may allow for 64-bit variants of the position-using instructions, which
could relax this restriction.)
-If an isolated surrogate is seen, the behavior determines on the
-*`$wtf8_policy`* immediate, as in `string.encode_wtf8`.
+If an isolated surrogate is seen, the behavior depends on the
+instruction:
+ * `stringview_wtf8.encode_utf8` will trap.
+ * `stringview_wtf8.encode_lossy_utf8` will encode `U+FFFD`.
+ * `stringview_wtf8.encode_wtf8` will encode the isolated surrogate.
```
(stringview_wtf8.slice view:stringview_wtf8 start:i32 end:i32)
-> str:stringref
```
Return a substring of *`view`*, for the WTF-8 bytes starting at offset
-*`start`* and not greater than *`end`*. *`start`* and *`end`* receive
-the "WTF-8 position treatment", as for `stringview_wtf8.advance`.
+*`start`* and continuing to but not including *`end`*. *`start`* and
+*`end`* receive the "WTF-8 position treatment", as for
+`stringview_wtf8.advance`.
### `stringview_wtf16`
@@ -448,10 +519,12 @@ Return the 16-bit code unit at offset *`pos`* in the WTF-16 encoding of
```
(stringview_wtf16.encode $memory view:stringview_wtf16 ptr:address pos:i32 len:i32)
+ -> codeunits:i32
```
Write a subsequence of the WTF-16 encoding of *`view`* to memory at
*`ptr`*, starting at the WTF-16 offset *`pos`*, writing no more than
*`len`* 16-bit code units. If *`ptr`* is not two-byte aligned, trap.
+Return the number of code units written.
If *`pos`* is greater than the number of WTF-16 code units in *`view`*,
it is as if it were instead given as the code unit length. This
@@ -462,8 +535,9 @@ transformation is the "WTF-16 position treatment".
-> str:stringref
```
Return a substring of *`view`*, for the WTF-16 code units starting at offset
-*`start`* and not greater than *`end`*. *`start`* and *`end`* receive
-the "WTF-16 position treatment", as for `stringview_wtf16.encode`.
+*`start`* and continuing to but not including *`end`*. *`start`* and
+*`end`* receive the "WTF-16 position treatment", as for
+`stringview_wtf16.encode`.
### `stringview_iter`
@@ -477,11 +551,12 @@ stringview can then be used to iterate over the codepoints of the
string.
```
-(stringview_iter.cur view:stringview_iter)
+(stringview_iter.next view:stringview_iter)
-> codepoint:i32
```
-Return the codepoint currently pointed to by *`view`*, or -1 if the
-iterator is at the end of the string.
+If *`view`* is already at the end of the string, return -1. Otherwise
+return the codepoint currently pointed to by the iterator, and advance
+the iterator's position by one codepoint.
```
(stringview_iter.advance view:stringview_iter codepoints:i32)
@@ -504,6 +579,86 @@ Return the number of codepoints that were actually consumed.
Return a substring of *`view`*, starting at the current position of
*`view`* and continuing for at most *`codepoints`* codepoints.
+### GC integration
+
+Though this proposal does not have a dependency on the [GC
+proposal](https://github.com/WebAssembly/gc/blob/master/proposals/gc/MVP.md),
+compiler authors that target GC will likely want to be able to encode
+the contents of a stringref to a GC array, and vice versa.
+
+The primary use cases are:
+
+ 1. String-builder interfaces, which will likely use a WTF-8 or WTF-16
+ array as intermediate storage, depending on the language being
+ compiled. We will need to be able to create strings from arrays.
+ When the string contents are ready, we will almost always decode
+ from array offset 0 and continue to some offset before the end of
+ the array. We'll also need to be able to append a string's contents
+ to an array at a given offset.
+ 2. Communicating strings with another process, possibly over the
+ network. Here, UTF-8 and WTF-8 are the important encodings, and we
+ need to be able to read and write to arbitrary slices of arrays.
+
+The instructions below shall be available in WebAssembly implementations
+that support both GC and stringrefs.
+
+```
+(string.new_utf8_array codeunits:$t start:i32 end:i32)
+ if expand($t) => array i8
+ -> str:stringref
+(string.new_lossy_utf8_array codeunits:$t start:i32 end:i32)
+ if expand($t) => array i8
+ -> str:stringref
+(string.new_wtf8_array codeunits:$t start:i32 end:i32)
+ if expand($t) => array i8
+ -> str:stringref
+```
+Create a new string from a subsequence of the *`codeunits`* bytes in a
+GC-managed array, starting from offset *`start`* and continuing to but
+not including *`end`*. If *`end`* is less than *`start`* or is greater
+than the array length, trap. The bytes are decoded in the same way as
+`string.new_utf8`, `string.new_lossy_utf8`, and `string.new_wtf8`,
+respectively. The maximum value for *`end`*–*`start`* is
+231–1; passing a higher value traps.
+
+```
+(string.new_wtf16_array codeunits:$t start:i32 end:i32)
+ if expand($t) => array i16
+ -> str:stringref
+```
+Create a new string from a subsequence of the *`codeunits`* WTF-16 code
+units in a GC-managed array, starting from offset *`start`* and
+continuing to but not including *`end`*. If *`end`* is less than
+*`start`* or is greater than the array length, trap. The maximum value
+for *`end`*–*`start`* is 230–1; passing a higher value
+traps.
+
+```
+(string.encode_utf8_array str:stringref array:$t start:i32)
+ if expand($t) => array (mut i8)
+ -> codeunits:i32
+(string.encode_lossy_utf8_array str:stringref array:$t start:i32)
+ if expand($t) => array (mut i8)
+ -> codeunits:i32
+(string.encode_wtf8_array str:stringref array:$t start:i32)
+ if expand($t) => array (mut i8)
+ -> codeunits:i32
+(string.encode_wtf16_array str:stringref array:$t start:i32)
+ if expand($t) => array (mut i16)
+ -> codeunits:i32
+```
+Encode the contents of the string *`str`* as WTF-8 or WTF-16,
+respectively, to the GC-managed array *`array`*, starting at offset
+*`start`*. Return the number of code units written, which will be the
+same as the result of a the corresponding `string.measure_wtf8` or
+`string.measure_wtf16`, respectively. If there is not space for the
+code units in the array, trap. Note that no `NUL` terminator is ever
+written.
+
+For `string.encode_utf8_array`, trap if an isolated surrogate is seen.
+For `string.encode_lossy_utf8_array`, replace isolated surrogates with
+`U+FFFD`.
+
## Binary encoding
```
@@ -513,35 +668,46 @@ reftype ::= ...
| 0x62 ⇒ stringview_wtf16 ; SLEB128(-0x1e)
| 0x61 ⇒ stringview_iter ; SLEB128(-0x1f)
-wtf8_policy ::= 0x00 ⇒ utf8
- | 0x01 ⇒ wtf8
- | 0x02 ⇒ replace
-
instr ::= ...
- | 0xfb 0x80 $mem:u32 $policy:u32 ⇒ string.new_wtf8 $mem $policy
- | 0xfb 0x81 $mem:u32 ⇒ string.new_wtf16 $mem
- | 0xfb 0x82 $idx:u32 ⇒ string.const $idx
- | 0xfb 0x84 $policy:u32 ⇒ string.measure_wtf8 $policy
- | 0xfb 0x85 ⇒ string.measure_wtf16
- | 0xfb 0x86 $mem:u32 $policy:u32 ⇒ string.encode_wtf8 $mem $policy
- | 0xfb 0x87 $mem:u32 ⇒ string.encode_wtf16 $mem
- | 0xfb 0x88 ⇒ string.concat
- | 0xfb 0x89 ⇒ string.eq
- | 0xfb 0x8a ⇒ string.is_usv_sequence
- | 0xfb 0x90 ⇒ string.as_wtf8
- | 0xfb 0x91 ⇒ stringview_wtf8.advance
- | 0xfb 0x92 $mem:u32 $policy:u32 ⇒ stringview_wtf8.encode $mem $policy
- | 0xfb 0x93 ⇒ stringview_wtf8.slice
- | 0xfb 0x98 ⇒ string.as_wtf16
- | 0xfb 0x99 ⇒ stringview_wtf16.length
- | 0xfb 0x9a ⇒ stringview_wtf16.get_codeunit
- | 0xfb 0x9b $mem:u32 ⇒ stringview_wtf16.encode $mem
- | 0xfb 0x9c ⇒ stringview_wtf16.slice
- | 0xfb 0xa0 ⇒ string.as_iter
- | 0xfb 0xa1 ⇒ stringview_iter.cur
- | 0xfb 0xa2 ⇒ stringview_iter.advance
- | 0xfb 0xa3 ⇒ stringview_iter.rewind
- | 0xfb 0xa4 ⇒ stringview_iter.slice
+ | 0xfb 0x80:u32 $mem:u32 ⇒ string.new_utf8 $mem
+ | 0xfb 0x81:u32 $mem:u32 ⇒ string.new_wtf16 $mem
+ | 0xfb 0x82:u32 $idx:u32 ⇒ string.const $idx
+ | 0xfb 0x83:u32 ⇒ string.measure_utf8
+ | 0xfb 0x84:u32 ⇒ string.measure_wtf8
+ | 0xfb 0x85:u32 ⇒ string.measure_wtf16
+ | 0xfb 0x86:u32 $mem:u32 ⇒ string.encode_utf8 $mem
+ | 0xfb 0x87:u32 $mem:u32 ⇒ string.encode_wtf16 $mem
+ | 0xfb 0x88:u32 ⇒ string.concat
+ | 0xfb 0x89:u32 ⇒ string.eq
+ | 0xfb 0x8a:u32 ⇒ string.is_usv_sequence
+ | 0xfb 0x8b:u32 $mem:u32 ⇒ string.new_lossy_utf8 $mem
+ | 0xfb 0x8c:u32 $mem:u32 ⇒ string.new_wtf8 $mem
+ | 0xfb 0x8d:u32 $mem:u32 ⇒ string.encode_lossy_utf8 $mem
+ | 0xfb 0x8e:u32 $mem:u32 ⇒ string.encode_wtf8 $mem
+ | 0xfb 0x90:u32 ⇒ string.as_wtf8
+ | 0xfb 0x91:u32 ⇒ stringview_wtf8.advance
+ | 0xfb 0x92:u32 $mem:u32 ⇒ stringview_wtf8.encode_utf8 $mem
+ | 0xfb 0x93:u32 ⇒ stringview_wtf8.slice
+ | 0xfb 0x94:u32 $mem:u32 ⇒ stringview_wtf8.encode_lossy_utf8 $mem
+ | 0xfb 0x95:u32 $mem:u32 ⇒ stringview_wtf8.encode_wtf8 $mem
+ | 0xfb 0x98:u32 ⇒ string.as_wtf16
+ | 0xfb 0x99:u32 ⇒ stringview_wtf16.length
+ | 0xfb 0x9a:u32 ⇒ stringview_wtf16.get_codeunit
+ | 0xfb 0x9b:u32 $mem:u32 ⇒ stringview_wtf16.encode $mem
+ | 0xfb 0x9c:u32 ⇒ stringview_wtf16.slice
+ | 0xfb 0xa0:u32 ⇒ string.as_iter
+ | 0xfb 0xa1:u32 ⇒ stringview_iter.next
+ | 0xfb 0xa2:u32 ⇒ stringview_iter.advance
+ | 0xfb 0xa3:u32 ⇒ stringview_iter.rewind
+ | 0xfb 0xa4:u32 ⇒ stringview_iter.slice
+ | 0xfb 0xb0:u32 [gc] ⇒ string.new_utf8_array
+ | 0xfb 0xb1:u32 [gc] ⇒ string.new_wtf16_array
+ | 0xfb 0xb2:u32 [gc] ⇒ string.encode_utf8_array
+ | 0xfb 0xb3:u32 [gc] ⇒ string.encode_wtf16_array
+ | 0xfb 0xb4:u32 [gc] ⇒ string.new_lossy_utf8_array
+ | 0xfb 0xb5:u32 [gc] ⇒ string.new_wtf8_array
+ | 0xfb 0xb6:u32 [gc] ⇒ string.encode_lossy_utf8_array
+ | 0xfb 0xb7:u32 [gc] ⇒ string.encode_wtf8_array
;; New section. If present, must be present only once, and right before
;; the globals section (or where the globals section would be). Each
@@ -552,10 +718,13 @@ instr ::= ...
stringrefs ::= section_14(0x00 vec(vec(u8)))
```
+Note that the u32 (uleb) encoding for the opcode after the `0xfb` prefix
+takes two bytes, for opcode values between 0x80 and 0x3fff.
+
## Examples
-We assume that the textual syntax for `string.encode` and `string.new`
-allows you to elide the memory, in which case it defaults to 0.
+We assume that the textual syntax for instructions that take a memory
+operand allows you to elide the memory, in which case it defaults to 0.
### Make string from NUL-terminated UTF-8 in memory
@@ -564,13 +733,12 @@ allows you to elide the memory, in which case it defaults to 0.
local.get $ptr
local.get $ptr
call $strlen
- string.new_wtf8)
+ string.new_utf8)
```
-Generally speaking, this proposal only distinguishes between UTF-8 and
-WTF-8 when encoding string contents to memory. As this is a a decode
-operation, the proposal just has a WTF-8 interface, as WTF-8 is a
-superset of UTF-8.
+If the bytes being decoded aren't actually valid UTF-8, this function
+will trap. Use `string.new_lossy_utf8` in contexts where replacing
+invalid data with `U+FFFD` is a better strategy than trapping.
### Make string from an array of WTF-8 code units in memory
@@ -581,6 +749,10 @@ superset of UTF-8.
string.new_wtf8)
```
+Note that `string.new_wtf8` (and `string.new_wtf8_array`) are always
+strict decoders: if the bytes are not valid WTF-8, the instruction
+traps.
+
### Make string from UTF-16 in memory
```wasm
@@ -638,7 +810,9 @@ rather it just deals in WTF-16, as most source languages that expose
local.get $str
string.as_wtf16
local.get $offset
+ local.get $offset
local.get $codeunits
+ i32.add
stringview_wtf16.slice)
```
@@ -778,7 +952,7 @@ open to considering adding more instructions.
(local $len i32)
(local $ptr i32)
local.get $str
- string.measure_wtf8 utf8
+ string.measure_utf8
local.set $len
block $valid
@@ -797,10 +971,9 @@ open to considering adding more instructions.
local.get $str
local.get $ptr
- string.encode_wtf8 wtf8
+ string.encode_utf8 ;; push bytes written, same as $len
local.get $ptr
- local.get $len
i32.add
i32.const 0
i32.store8 ;; write NUL
@@ -809,12 +982,17 @@ open to considering adding more instructions.
return)
```
-Using `string.measure_wtf8 utf8` ensures that the encoded string is a
-valid unicode scalar value sequence. How to handle invalid UTF-8 is up
-to the user; instead of `unreachable` we could throw an exception.
+Using `string.measure_utf8` ensures that the encoded string is a valid
+unicode scalar value sequence. How to handle invalid UTF-8 is up to the
+user; instead of `unreachable` we could throw an exception.
+
+Note that in this case, the subsequent `string.encode_utf8` could just
+as well have been `string.encode_lossy_utf8` or `string.encode_wtf8`, as
+these instructions are all the same for strings that do not contain
+isolated surrogates, and we checked that there were none.
If we meant to handle isolated surrogates, we could use
-`string.measure_wtf8 wtf8` instead.
+`string.measure_wtf8` instead.
### Stream over contents of string
@@ -834,7 +1012,7 @@ will encode isolated surrogates as WTF-8.
local.get $cursor
global.get $buf
i32.const 1024
- string.encode_wtf8 wtf8 ;; push bytes written
+ string.encode_wtf8 ;; push bytes written
local.tee $bytes
(if i32.eqz (then return)) ;; if no bytes encoded, done
local.get $bytes
@@ -908,7 +1086,7 @@ WTF-8 in memory, for longer strings.
block $done
loop $loop
local.get $iter
- stringview_iter.cur
+ stringview_iter.next
local.tee $ch
i32.const -1
@@ -917,11 +1095,6 @@ WTF-8 in memory, for longer strings.
local.get $ch
call $have-codepoint
-
- local.get $iter
- i32.const 1
- string.advance_wtf8
- drop
end
end)
```