string_decoder: fix handling of malformed utf8#7318
Closed
gagern wants to merge 1 commit intonodejs:masterfrom
Closed
string_decoder: fix handling of malformed utf8#7318gagern wants to merge 1 commit intonodejs:masterfrom
gagern wants to merge 1 commit intonodejs:masterfrom
Conversation
3 tasks
dc94308 to
d11e41c
Compare
There have been problems with utf8 decoding in cases where the input was invalid. Some cases would give different results depending on chunking, while others even led to exceptions. This commit simplifies the code in several ways, reducing the risk of breakage. Most importantly, the `text` method is not only used for bulk conversion of a single chunk, but also for the conversion of the mini buffer `lastChar` while handling characters spanning chunks. That should make most of the problems due to incorrect handling of special cases disappear. Secondly, that `text` method is now independent of encoding. The encoding-dependent method `complete` now has a single well-defined task: determine the buffer position up to which the input consists of complete characters. The actual conversion is handled in a central location, leading to a cleaner and leaner internal interface. Thirdly, we no longer try to remember just how many bytes we'll need for the next complete character. We simply try to fill up the `nextChar` buffer and perform a conversion on that. This reduces the number of internal counter variables from two to one, namely `partial` which indicates the number of bytes currently stored in `nextChar`. A possible drawback of this approach is that there is chance of degraded performance if input arrives one byte at a time and is from a script using long utf8 sequences. As a benefit, though, this allows us to emit a U+FFFD replacement character sooner in cases where the first byte of an utf8 sequence is not followed by the expected number of continuation bytes. Fixes: nodejs#7308
d11e41c to
588864d
Compare
Contributor
Author
|
I have to concede that my changes appear to come at some performance cost, particularly for the base64 encodings. Comparing 588864d with its direct parent 1a1ff77 I see this benchmark comparison: I guess the degraded base64 performance might be due to extra method invocation from |
Contributor
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Checklist
make -j4 test(UNIX) orvcbuild test nosign(Windows) passesAffected core subsystem(s)
Description of change
There have been problems with utf8 decoding in cases where the input was invalid. Some cases would give different results depending on chunking, while others even led to exceptions. This commit simplifies the code in several ways, reducing the risk of breakage.
Most importantly, the
textmethod is not only used for bulk conversion of a single chunk, but also for the conversion of the mini bufferlastCharwhile handling characters spanning chunks. That should make most of the problems due to incorrect handling of special cases disappear.Secondly, that
textmethod is now independent of encoding. The encoding-dependent methodcompletenow has a single well-defined task: determine the buffer position up to which the input consists of complete characters. The actual conversion is handled in a central location, leading to a cleaner and leaner internal interface.Thirdly, we no longer try to remember just how many bytes we'll need for the next complete character. We simply try to fill up the
nextCharbuffer and perform a conversion on that. This reduces the number of internal counter variables from two to one, namelypartialwhich indicates the number of bytes currently stored innextChar.A possible drawback of this approach is that there is chance of degraded performance if input arrives one byte at a time and is from a script using long utf8 sequences. As a benefit, though, this allows us
to emit a U+FFFD replacement character sooner in cases where the first byte of an utf8 sequence is not followed by the expected number of continuation bytes.
Fixes: #7308
This is an alternative to #7310; merging this makes that obsolete.