buffer: implement WHATWG Encoding Standard API#13644
buffer: implement WHATWG Encoding Standard API#13644jasnell wants to merge 1 commit intonodejs:masterfrom
Conversation
|
What's special about iso-8859-16? |
|
It's not implemented by ICU at all, even with full-icu |
TimothyGu
left a comment
There was a problem hiding this comment.
Some preliminary comments. I'll need some time to digest the intricacies of BOM and the various flags related to it.
lib/internal/encoding.js
Outdated
There was a problem hiding this comment.
This function doesn't have to handle defaults -- nor should it. The two places that use this function both ensure encoding !== undefined. I also somewhat prefer the signature function getEncodingFromLabel(label) to align better with the spec.
lib/internal/encoding.js
Outdated
There was a problem hiding this comment.
ASCII whitespace (and ASCII whitespace only, not Unicode) needs to be trimmed as well per step 1 of the spec.
lib/internal/encoding.js
Outdated
There was a problem hiding this comment.
Just return enc w/o if should be fine and simpler.
src/node_i18n.cc
Outdated
There was a problem hiding this comment.
How about just !!U_SUCCESS(status)? The == 1 looks kinda fragile.
lib/internal/encoding.js
Outdated
There was a problem hiding this comment.
Object.defineProperties is pretty slow to be used on per object construction. I don't see any harm in making kHandle configurable, enumerable, or writable, as it's a symbol property already (and it's what we do for URL). encoding, fatal, and ignoreBOM should all be defined on the prototype as getter functions per spec.
src/node_i18n.cc
Outdated
There was a problem hiding this comment.
Use the non-deprecated Uint32Value(Local<Context>) signature of the function.
src/node_i18n.cc
Outdated
There was a problem hiding this comment.
Independent of this PR, we should find a way to cache this ObjectTemplate.
Potentially dumb question: why not use ObjectWrap?
There was a problem hiding this comment.
For no particular reason other than ObjectWrap really isn't used anywhere else in core. I've actually been contemplating whether or not it is something we could eliminate at some point.
src/node_i18n.cc
Outdated
There was a problem hiding this comment.
SetLength, too, takes number of entries as argument, not number of bytes.
tools/icu/icu-generic.gyp
Outdated
There was a problem hiding this comment.
Does the comment still apply?
There was a problem hiding this comment.
Nope. Was planning to remove the whole section.
doc/api/buffer.md
Outdated
There was a problem hiding this comment.
Is it documented anywhere how to get full ICU?
There was a problem hiding this comment.
https://github.com/nodejs/node/blob/master/BUILDING.md#intl-ecma-402-support
+
https://github.com/nodejs/node/wiki/Intl (there is a link to it in the first fragment).
There was a problem hiding this comment.
This should be better documented.
lib/internal/encoding.js
Outdated
There was a problem hiding this comment.
Per spec, the returned Uint8Array must be unpooled and have an entire block of ArrayBuffer dedicated to it.
doc/api/buffer.md
Outdated
There was a problem hiding this comment.
[WHATWG Encoding Standard][] (+ second pair of brackets) for consistency?
doc/api/buffer.md
Outdated
There was a problem hiding this comment.
while( -> while ( ?
doc/api/buffer.md
Outdated
There was a problem hiding this comment.
stream:true -> stream: true ?
doc/api/buffer.md
Outdated
There was a problem hiding this comment.
Note about defaults (false = strip the BOM)?
There was a problem hiding this comment.
(
false= strip the BOM)
That's correct.
var dec = new TextDecoder('utf-16be', { ignoreBOM: true });
[...dec.decode(new Uint8Array([0xfe, 0xff, 0x00, 0x20]))].map(str => str.codePointAt(0).toString(16));
// ["feff", "20"]var dec = new TextDecoder('utf-16be', { ignoreBOM: false });
[...dec.decode(new Uint8Array([0xfe, 0xff, 0x00, 0x20]))].map(str => str.codePointAt(0).toString(16));
// ["20"]There was a problem hiding this comment.
I mean maybe it is worth to add a note about the default value.
There was a problem hiding this comment.
Why we need to require Buffer if we state in the doc:
The
Bufferclass is a global within Node.js, making it unlikely that one would need to ever userequire('buffer').Buffer
There was a problem hiding this comment.
There was a problem hiding this comment.
Why do we use it only in the lib/.eslintrc.yaml and not in the test/.eslintrc.yaml?
There was a problem hiding this comment.
I'm guessing here: tests are supposed to be self-contained scripts, while everything in lib is user-visible and -contaminatable, like a seemingly innocent var Buffer = {} line in REPL. See nodejs/node-convergence-archive#21.
|
Updated |
doc/api/buffer.md
Outdated
There was a problem hiding this comment.
If https://help.github.com/articles/organizing-information-with-tables/ works as advertised, github markdown tables would be easier to read in-source and to review.
There was a problem hiding this comment.
As far as I understand (and I do not see anything in the referenced link) .. markdown tables do not support rowspan.
doc/api/buffer.md
Outdated
There was a problem hiding this comment.
I wonder if withBOM (EDIT: or leaveBOM) would make more sense for this, if its not a fixed name due to the spec? I thought the docs might have had a typo and reversed the sense, because if the BOM is ignored it wouldn't be in the output, then I thought about it, and realized the sense of ignore is "don't remove from input stream if present". I'm not sure what would happen if the BOM is not in the input stream, though, would it get added to the decoded result?
There was a problem hiding this comment.
ignoreBOM is fixed within the spec. To change this would require a change in the spec.
There was a problem hiding this comment.
Understood. I know this is all experimental, but FYI, its not clear to me from current docs whether a BOM is added if not present, or just passed through ("ignored") if present.
lib/internal/encoding.js
Outdated
There was a problem hiding this comment.
The Web IDL spec treats null the same way as undefined. See https://heycam.github.io/webidl/#es-dictionary, step 4.1.2.
lib/internal/encoding.js
Outdated
There was a problem hiding this comment.
IDL recognizes all truthy values as true, so just options.stream ?.
lib/internal/encoding.js
Outdated
There was a problem hiding this comment.
This should coerce encoding to string rather than check its type, per IDL rules. See here for how it's done in URL.
lib/internal/encoding.js
Outdated
There was a problem hiding this comment.
Here too: null should be treated just like undefined. Also, options cannot be undefined here due to the default parameter above.
lib/internal/encoding.js
Outdated
There was a problem hiding this comment.
SharedArrayBuffer is not supported at this moment in Web platform APIs that do not have [AllowShared] extended attribute. See https://heycam.github.io/webidl/#es-buffer-source-types.
There was a problem hiding this comment.
Yep, understood. Unfortunately, the separate process.binding('util').isArrayBuffer() and process.binding('util').isSharedArrayBuffer() method no longer exist. It looks like those were condensed recently into a single IsAnyArrayBuffer(). I think this is one we can safely ignore for now.
There was a problem hiding this comment.
Seems bad to ignore, as becoming spec-complaint later will be a backward-incompatible breaking change.
There was a problem hiding this comment.
Yeah, agreed. I've added the isArrayBuffer() util method back in to process.binding('util')
src/node_i18n.cc
Outdated
src/node_i18n.cc
Outdated
There was a problem hiding this comment.
This should use the Local<Context> version as well.
src/node_i18n.cc
Outdated
There was a problem hiding this comment.
Put && !converter->ignoreBOM_ here as well?
src/node_i18n.cc
Outdated
There was a problem hiding this comment.
This flag should only be set when there actually is a BOM, i.e. bomOffset != 0.
There was a problem hiding this comment.
Not quite ...
In: https://encoding.spec.whatwg.org/#interface-textdecoder
* If encoding is UTF-8, UTF-16BE, or UTF-16LE, and ignore BOM flag and BOM seen flag are unset, then:
* If token is U+FEFF, then set BOM seen flag.
* Otherwise, if token is not end-of-stream, then set BOM seen flag and append token to output.
* Otherwise, return output.
src/node_i18n.cc
Outdated
There was a problem hiding this comment.
The ICU docs say:
For most Unicode charsets it is also possible to ignore the indicated number of initial stream bytes and start converting after them. However, there are stateful Unicode charsets (UTF-7 and BOCU-1) for which this will not work. Therefore, it is best to ignore the first output UChar instead of the input signature bytes.
AFAICT source is incremented by the signature bytes, so we are doing what the ICU docs advise not to do. Are the negative consequences applicable to us?
There was a problem hiding this comment.
I do not believe so. TextDecoder does not support UTF-7 nor BOCU-1. We are only skipping the signature bytes for UTF-8, UTF-16le, and UTF-16be, none of which have this issue.
lib/internal/encoding.js
Outdated
There was a problem hiding this comment.
input is an optional parameter too. See https://encoding.spec.whatwg.org/#interface-textencoder
lib/internal/encoding.js
Outdated
There was a problem hiding this comment.
You have to use template literal syntax to prevent symbols from getting converted to strings. See
Line 56 in 448c4c6
lib/internal/encoding.js
Outdated
1e8402f to
97f8476
Compare
|
Updated :-) |
TimothyGu
left a comment
There was a problem hiding this comment.
Looking much better, thanks a lot!
doc/api/buffer.md
Outdated
There was a problem hiding this comment.
Also DataView. The %TypedArray% notation is generally reserved to specs, so do you think it is appropriate to use it here?
doc/api/buffer.md
Outdated
doc/api/buffer.md
Outdated
lib/internal/encoding.js
Outdated
There was a problem hiding this comment.
Template string literal is needed here as well (to disallow symbols).
lib/internal/encoding.js
Outdated
There was a problem hiding this comment.
This check should technically be used for the user-visible getters and the methods as well.
There was a problem hiding this comment.
Yeah, for now I'd rather leave it just here tho as it's not a check we commonly do.
There was a problem hiding this comment.
FYI, we are doing this for URLs (see Web IDL create a operation function step 2.1.2.4). I'll defer to your opinion regarding these classes though.
There was a problem hiding this comment.
I would recommend adding appropriate type-checking here if possible; not a big deal, but seems unnecessary to diverge from the spec and from URL in that way.
There was a problem hiding this comment.
@domenic ... just want to make sure I'm clear on what kind of checking you're recommending? instanceof or is the more efficient hidden-symbol check ok?
There was a problem hiding this comment.
Hidden symbol check for sure. Although ideally one that distinguishes encoder vs. decoder.
lib/internal/encoding.js
Outdated
There was a problem hiding this comment.
Same here. encoding and encode should have this check too.
There was a problem hiding this comment.
For the time being, I'd prefer not.
lib/internal/encoding.js
Outdated
There was a problem hiding this comment.
:/ In this case I'd rather add a dummy property through a constructor, just so that subclasses that extend TextEncoder but override Symbol.toStringTag can work.
src/node_i18n.cc
Outdated
lib/internal/encoding.js
Outdated
There was a problem hiding this comment.
Would a regex like /[\t\n\f\r ]/ be faster?
|
A bug I noticed: https://encoding.spec.whatwg.org/#dom-textdecoder-decode step 1 unsets the BOM seen flag if do not flush flag resulted from a previous run is not set (i.e. this is either the first call to > var dec = new buffer.TextDecoder('utf-8');
> [...dec.decode(Uint8Array.of(0xEF, 0xBB, 0xBF, 0x20))].map(str => str.codePointAt(0).toString(16));
[ '20' ]
> [...dec.decode(Uint8Array.of(0xEF, 0xBB, 0xBF, 0x20))].map(str => str.codePointAt(0).toString(16));
[ 'feff', '20' ]The correct behavior seems to also be the one implemented by Chrome: > var dec = new TextDecoder('utf-8');
> [...dec.decode(Uint8Array.of(0xEF, 0xBB, 0xBF, 0x20))].map(str => str.codePointAt(0).toString(16));
[ '20' ]
> [...dec.decode(Uint8Array.of(0xEF, 0xBB, 0xBF, 0x20))].map(str => str.codePointAt(0).toString(16));
[ '20' ] |
lib/internal/encoding.js
Outdated
There was a problem hiding this comment.
Rather than mentioning these as "unsupported", we can in fact say they don't necessarily need to be supported, since TextDecoder errors out on 'replacement'.
lib/internal/encoding.js
Outdated
There was a problem hiding this comment.
If we make sure no entries in encodings Map actually map to 'replacement' as value, we don't have to explicitly check for it here.
|
@TimothyGu ... updated! |
There was a problem hiding this comment.
LGTM as is, but some further observations are below. WPT integration can be worked on after this PR is merged.
I do have one more suggested edit: TimothyGu@0d0f7fc, which eliminates the need to copy from a Buffer to an Uint8Array.
lib/internal/encoding.js
Outdated
There was a problem hiding this comment.
FYI, we are doing this for URLs (see Web IDL create a operation function step 2.1.2.4). I'll defer to your opinion regarding these classes though.
lib/internal/encoding.js
Outdated
There was a problem hiding this comment.
This unfortunately will not distinguish TextEncoder and TextDecoder objects, as both have this property defined.
Provide an (initially experimental) implementation of the WHATWG Encoding Standard API (`TextDecoder` and `TextEncoder`). The is the same API implemented on the browser side. By default, with small-icu, only the UTF-8, UTF-16le and UTF-16be decoders are supported. With full-icu enabled, every encoding other than iso-8859-16 is supported. This provides a basic test, but does not include the full web platform tests. Note: many of the web platform tests for this would fail by default because we ship with small-icu by default. A process warning will be emitted on first use to indicate that the API is still experimental. No runtime flag is required to use the feature. Refs: https://encoding.spec.whatwg.org/ PR-URL: nodejs#13644 Reviewed-By: Timothy Gu <timothygu99@gmail.com> Reviewed-By: Matteo Collina <matteo.collina@gmail.com>
Provide an (initially experimental) implementation of the WHATWG Encoding Standard API (`TextDecoder` and `TextEncoder`). The is the same API implemented on the browser side. By default, with small-icu, only the UTF-8, UTF-16le and UTF-16be decoders are supported. With full-icu enabled, every encoding other than iso-8859-16 is supported. This provides a basic test, but does not include the full web platform tests. Note: many of the web platform tests for this would fail by default because we ship with small-icu by default. A process warning will be emitted on first use to indicate that the API is still experimental. No runtime flag is required to use the feature. Backport-PR-URL: #14585 Backport-Reviewed-By: Anna Henningsen <anna@addaleax.net> Refs: https://encoding.spec.whatwg.org/ PR-URL: #13644 Reviewed-By: Timothy Gu <timothygu99@gmail.com> Reviewed-By: Matteo Collina <matteo.collina@gmail.com>
V8 6.0: The V8 engine has been upgraded to version 6.0, which has a significantly changed performance profile. [#14574](#14574) More detailed information on performance differences can be found at https://medium.com/the-node-js-collection/get-ready-a-new-v8-is-coming-node-js-performance-is-changing-46a63d6da4de Other notable changes: * **DNS** * Independent DNS resolver instances are supported now, with support for cancelling the corresponding requests. [#14518](#14518) * **REPL** * Autocompletion support for `require()` has been improved. [#14409](#14409) * **Utilities** * The WHATWG Encoding Standard (`TextDecoder` and `TextEncoder`) has been implemented. [#13644](#13644) * **Added new collaborators** * [XadillaX](https://github.com/XadillaX) – Khaidi Chu
V8 6.0: The V8 engine has been upgraded to version 6.0, which has a significantly changed performance profile. [#14574](#14574) More detailed information on performance differences can be found at https://medium.com/the-node-js-collection/get-ready-a-new-v8-is-coming-node-js-performance-is-changing-46a63d6da4de Other notable changes: * **DNS** * Independent DNS resolver instances are supported now, with support for cancelling the corresponding requests. [#14518](#14518) * **N-API** * Multiple N-API functions for error handling have been changed to support assigning error codes. [#13988](#13988) * **REPL** * Autocompletion support for `require()` has been improved. [#14409](#14409) * **Utilities** * The WHATWG Encoding Standard (`TextDecoder` and `TextEncoder`) has been implemented as an experimental feature. [#13644](#13644) * **Added new collaborators** * [XadillaX](https://github.com/XadillaX) – Khaidi Chu
V8 6.0: The V8 engine has been upgraded to version 6.0, which has a significantly changed performance profile. [#14574](#14574) More detailed information on performance differences can be found at https://medium.com/the-node-js-collection/get-ready-a-new-v8-is-coming-node-js-performance-is-changing-46a63d6da4de Other notable changes: * **DNS** * Independent DNS resolver instances are supported now, with support for cancelling the corresponding requests. [#14518](#14518) * **N-API** * Multiple N-API functions for error handling have been changed to support assigning error codes. [#13988](#13988) * **REPL** * Autocompletion support for `require()` has been improved. [#14409](#14409) * **Utilities** * The WHATWG Encoding Standard (`TextDecoder` and `TextEncoder`) has been implemented as an experimental feature. [#13644](#13644) * **Added new collaborators** * [XadillaX](https://github.com/XadillaX) – Khaidi Chu
V8 6.0: The V8 engine has been upgraded to version 6.0, which has a significantly changed performance profile. [#14574](#14574) More detailed information on performance differences can be found at https://medium.com/the-node-js-collection/get-ready-a-new-v8-is-coming-node-js-performance-is-changing-46a63d6da4de Other notable changes: * **DNS** * Independent DNS resolver instances are supported now, with support for cancelling the corresponding requests. [#14518](#14518) * **N-API** * Multiple N-API functions for error handling have been changed to support assigning error codes. [#13988](#13988) * **REPL** * Autocompletion support for `require()` has been improved. [#14409](#14409) * **Utilities** * The WHATWG Encoding Standard (`TextDecoder` and `TextEncoder`) has been implemented as an experimental feature. [#13644](#13644) * **Added new collaborators** * [XadillaX](https://github.com/XadillaX) – Khaidi Chu
| 'defines': [ | ||
| # ICU cannot swap the initial data without this. | ||
| # http://bugs.icu-project.org/trac/ticket/11046 | ||
| 'UCONFIG_NO_LEGACY_CONVERSION=1' |
There was a problem hiding this comment.
FYI, Coverity is none too happy about this being re-enabled.
V8 6.0: The V8 engine has been upgraded to version 6.0, which has a significantly changed performance profile. [#14574](#14574) More detailed information on performance differences can be found at https://medium.com/the-node-js-collection/get-ready-a-new-v8-is-coming-node-js-performance-is-changing-46a63d6da4de Other notable changes: * **DNS** * Independent DNS resolver instances are supported now, with support for cancelling the corresponding requests. [#14518](#14518) * **N-API** * Multiple N-API functions for error handling have been changed to support assigning error codes. [#13988](#13988) * **REPL** * Autocompletion support for `require()` has been improved. [#14409](#14409) * **Utilities** * The WHATWG Encoding Standard (`TextDecoder` and `TextEncoder`) has been implemented as an experimental feature. [#13644](#13644) * **Added new collaborators** * [XadillaX](https://github.com/XadillaX) – Khaidi Chu * [gabrielschulhof](https://github.com/gabrielschulhof) – Gabriel Schulhof
V8 6.0: The V8 engine has been upgraded to version 6.0, which has a significantly changed performance profile. [#14574](#14574) More detailed information on performance differences can be found at https://medium.com/the-node-js-collection/get-ready-a-new-v8-is-coming-node-js-performance-is-changing-46a63d6da4de Other notable changes: * **DNS** * Independent DNS resolver instances are supported now, with support for cancelling the corresponding requests. [#14518](#14518) * **N-API** * Multiple N-API functions for error handling have been changed to support assigning error codes. [#13988](#13988) * **REPL** * Autocompletion support for `require()` has been improved. [#14409](#14409) * **Utilities** * The WHATWG Encoding Standard (`TextDecoder` and `TextEncoder`) has been implemented as an experimental feature. [#13644](#13644) * **Added new collaborators** * [XadillaX](https://github.com/XadillaX) – Khaidi Chu * [gabrielschulhof](https://github.com/gabrielschulhof) – Gabriel Schulhof Conflicts: src/node_version.h
PR-URL: nodejs#13916 Refs: nodejs#13644 (comment) Reviewed-By: Vse Mozhet Byt <vsemozhetbyt@gmail.com>
PR-URL: #13916 Refs: #13644 (comment) Reviewed-By: Vse Mozhet Byt <vsemozhetbyt@gmail.com>
|
Release team decided not to land on v6.x, if you disagree let us know. |
Provide an (initially experimental) implementation of the WHATWG Encoding Standard API (
TextDecoderandTextEncoder). The is the same API implemented on the browser side.By default, with small-icu, only the UTF-8, UTF-16le and UTF-16be decoders are supported. With full-icu enabled, every encoding other than iso-8859-16 is supported.
This provides a basic test, but does not include the full web platform tests. Note: many of the web platform tests for this would fail by default because we ship with small-icu by default.
The implementation is added without changing any of the existing encoding support in core.
Refs: https://encoding.spec.whatwg.org/
/cc @domenic @TimothyGu @addaleax
Checklist
make -j4 test(UNIX), orvcbuild test(Windows) passesAffected core subsystem(s)
buffer