Skip to content

Commit f5511ee

Browse files
committed
minor fixes
1 parent 7c05948 commit f5511ee

File tree

1 file changed

+91
-87
lines changed

1 file changed

+91
-87
lines changed

1-js/05-data-types/03-string/article.md

Lines changed: 91 additions & 87 deletions
Original file line numberDiff line numberDiff line change
@@ -59,10 +59,10 @@ It is still possible to create multiline strings with single and double quotes b
5959
```js run
6060
let guestList = "Guests:\n * John\n * Pete\n * Mary";
6161

62-
alert(guestList); // a multiline list of guests
62+
alert(guestList); // a multiline list of guests, same as above
6363
```
6464

65-
For example, these two lines are equal, just written differently:
65+
As a simpler example, these two lines are equal, just written differently:
6666

6767
```js run
6868
let str1 = "Hello\nWorld"; // two lines using a "newline symbol"
@@ -74,33 +74,26 @@ World`;
7474
alert(str1 == str2); // true
7575
```
7676

77-
There are other, less common "special" characters.
78-
79-
Here's the full list:
77+
There are other, less common "special" characters:
8078

8179
| Character | Description |
8280
|-----------|-------------|
8381
|`\n`|New line|
8482
|`\r`|In Windows text files a combination of two characters `\r\n` represents a new break, while on non-Windows OS it's just `\n`. That's for historical reasons, most Windows software also understands `\n`. |
85-
|`\'`, `\"`|Quotes|
83+
|`\'`,&nbsp;`\"`,&nbsp;<code>\\`</code>|Quotes|
8684
|`\\`|Backslash|
8785
|`\t`|Tab|
88-
|`\b`, `\f`, `\v`| Backspace, Form Feed, Vertical Tab -- kept for compatibility, not used nowadays. |
89-
|`\xXX`|A character whose [Unicode](https://en.wikipedia.org/wiki/Unicode) code point is `U+00XX`. `XX` is always two hexadecimal digits with value between `00` and `FF`, so `\xXX` notation can be used only for the first 256 Unicode characters (including all 128 ASCII characters). For example, `"\x7A"` is the same as `"z"` (Unicode code point `U+007A`).|
90-
|`\uXXXX`|A character whose Unicode code point is `U+XXXX` (a character with the hex code `XXXX` in UTF-16 encoding). `XXXX` must be exactly 4 hex digits with the value between `0000` and `FFFF`, so `\uXXXX` notation can be used for the first 65536 Unicode characters. Characters with Unicode value greater than `U+FFFF` can also be represented with this notation, but in this case we will need to use a so called surrogate pair (we will talk about surrogate pairs later in this chapter). For instance, `"\u00A9"` is a copyright symbol `©` (Unicode code point `U+00A9`), but for smiling cat face 😺 we have to use a surrogate pair `"\uD83D\uDE3A"` (because its Unicode code point `U+1F63A` is greater than `U+FFFF`).|
91-
|`\u{X…XXXXXX}` (1 to 6 hex characters)|A character with any given Unicode code point (a character with the given hex code in UTF-32 encoding). `X…XXXXXX` is a hex value between `0` and `10FFFF` (the highest code point defined by Unicode). This notation was added to the language in ECMAScript 2015 (ES6) standard and allows us to easily represent all existing Unicode characters without need for surrogate pairs. Unlike previous two notations, there is no need to add leading zeros for characters with "small" code point values: `"\u{7A}"`, `"\u{007A}"` and `"\u{00007A}"` are all acceptable.|
86+
|`\b`, `\f`, `\v`| Backspace, Form Feed, Vertical Tab -- mentioned for completeness, coming from old times, not used nowadays (you can forget them right now). |
9287

93-
Examples with Unicode:
88+
As you can see, all special characters start with a backslash character `\`. It is also called an "escape character".
89+
90+
Because it's so special, if we need to show an actual backslash `\` within the string, we need to double it:
9491

9592
```js run
96-
alert( "\u00A9" ); // ©, we will get the very same result with alert( "\xA9" ) and alert( "\u{A9}" )
97-
alert( "\u{20331}" ); // 佫, a rare Chinese hieroglyph (long Unicode)
98-
alert( "\u{1F60D}" ); // 😍, a smiling face symbol (another long Unicode)
93+
alert( `The backslash: \\` ); // The backslash: \
9994
```
10095

101-
All special characters start with a backslash character `\`. It is also called an "escape character".
102-
103-
We might also use it if we wanted to insert a quote into the string.
96+
So-called "escaped" quotes `\'`, `\"`, <code>\\`</code> are used to insert a quote into the same-quoted string.
10497

10598
For instance:
10699

@@ -113,18 +106,10 @@ As you can see, we have to prepend the inner quote by the backslash `\'`, becaus
113106
Of course, only the quotes that are the same as the enclosing ones need to be escaped. So, as a more elegant solution, we could switch to double quotes or backticks instead:
114107

115108
```js run
116-
alert( `I'm the Walrus!` ); // I'm the Walrus!
109+
alert( "I'm the Walrus!" ); // I'm the Walrus!
117110
```
118111

119-
Note that the backslash `\` serves for the correct reading of the string by JavaScript, then disappears. The in-memory string has no `\`. You can clearly see that in `alert` from the examples above.
120-
121-
But what if we need to show an actual backslash `\` within the string?
122-
123-
That's possible, but we need to double it like `\\`:
124-
125-
```js run
126-
alert( `The backslash: \\` ); // The backslash: \
127-
```
112+
Besides these special characters, there's also a special notation for Unicode codes `\u…`, we'll cover it a bit later in this chapter.
128113

129114
## String length
130115

@@ -310,45 +295,6 @@ if (str.indexOf("Widget") != -1) {
310295
}
311296
```
312297

313-
#### The bitwise NOT trick
314-
315-
One of the old tricks used here is the [bitwise NOT](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Operators/Bitwise_NOT) `~` operator. It converts the number to a 32-bit integer (removes the decimal part if exists) and then reverses all bits in its binary representation.
316-
317-
In practice, that means a simple thing: for 32-bit integers `~n` equals `-(n+1)`.
318-
319-
For instance:
320-
321-
```js run
322-
alert( ~2 ); // -3, the same as -(2+1)
323-
alert( ~1 ); // -2, the same as -(1+1)
324-
alert( ~0 ); // -1, the same as -(0+1)
325-
*!*
326-
alert( ~-1 ); // 0, the same as -(-1+1)
327-
*/!*
328-
```
329-
330-
As we can see, `~n` is zero only if `n == -1` (that's for any 32-bit signed integer `n`).
331-
332-
So, the test `if ( ~str.indexOf("...") )` is truthy only if the result of `indexOf` is not `-1`. In other words, when there is a match.
333-
334-
People use it to shorten `indexOf` checks:
335-
336-
```js run
337-
let str = "Widget";
338-
339-
if (~str.indexOf("Widget")) {
340-
alert( 'Found it!' ); // works
341-
}
342-
```
343-
344-
It is usually not recommended to use language features in a non-obvious way, but this particular trick is widely used in old code, so we should understand it.
345-
346-
Just remember: `if (~str.indexOf(...))` reads as "if found".
347-
348-
To be precise though, as big numbers are truncated to 32 bits by `~` operator, there exist other numbers that give `0`, the smallest is `~4294967295=0`. That makes such check correct only if a string is not that long.
349-
350-
Right now we can see this trick only in the old code, as modern JavaScript provides `.includes` method (see below).
351-
352298
### includes, startsWith, endsWith
353299

354300
The more modern method [str.includes(substr, pos)](mdn:js/String/includes) returns `true/false` depending on whether `str` contains `substr` within.
@@ -407,7 +353,7 @@ There are 3 methods in JavaScript to get a substring: `substring`, `substr` and
407353
```
408354

409355
`str.substring(start [, end])`
410-
: Returns the part of the string *between* `start` and `end` (not including the greater of them).
356+
: Returns the part of the string *between* `start` and `end` (not including `end`).
411357

412358
This is almost the same as `slice`, but it allows `start` to be greater than `end` (in this case it simply swaps `start` and `end` values).
413359

@@ -452,13 +398,15 @@ Let's recap these methods to avoid any confusion:
452398
| method | selects... | negatives |
453399
|--------|-----------|-----------|
454400
| `slice(start, end)` | from `start` to `end` (not including `end`) | allows negatives |
455-
| `substring(start, end)` | between `start` and `end` (not including the greater of them)| negative values mean `0` |
401+
| `substring(start, end)` | between `start` and `end` (not including `end`)| negative values mean `0` |
456402
| `substr(start, length)` | from `start` get `length` characters | allows negative `start` |
457403

458404
```smart header="Which one to choose?"
459405
All of them can do the job. Formally, `substr` has a minor drawback: it is described not in the core JavaScript specification, but in Annex B, which covers browser-only features that exist mainly for historical reasons. So, non-browser environments may fail to support it. But in practice it works everywhere.
460406
461-
Of the other two variants, `slice` is a little bit more flexible, it allows negative arguments and shorter to write. So, it's enough to remember solely `slice` of these three methods.
407+
Of the other two variants, `slice` is a little bit more flexible, it allows negative arguments and shorter to write.
408+
409+
So, for practical use it's enough to remember only `slice`.
462410
```
463411

464412
## Comparing strings
@@ -560,62 +508,118 @@ This method actually has two additional arguments specified in [the documentatio
560508

561509
```warn header="Advanced knowledge"
562510
The section goes deeper into string internals. This knowledge will be useful for you if you plan to deal with emoji, rare mathematical or hieroglyphic characters or other rare symbols.
511+
```
512+
513+
## Unicode characters
514+
515+
As we already mentioned, JavaScript strings are based on [Unicode](https://en.wikipedia.org/wiki/Unicode).
516+
517+
Each character is represented by a byte sequence of 1-4 bytes.
518+
519+
JavaScript allows us to specify a character by its Unicode value using these three notations:
520+
521+
- `\xXX` -- a character whose Unicode code point is `U+00XX`.
522+
523+
`XX` is always two hexadecimal digits with value between `00` and `FF`, so `\xXX` notation can be used only for the first 256 Unicode characters (including all 128 ASCII characters).
524+
525+
These first 256 characters include latin alphabet, most basic syntax characters and some others. For example, `"\x7A"` is the same as `"z"` (Unicode `U+007A`).
526+
- `\uXXXX` -- a character whose Unicode code point is `U+XXXX` (a character with the hex code `XXXX` in UTF-16 encoding).
527+
528+
`XXXX` must be exactly 4 hex digits with the value between `0000` and `FFFF`, so `\uXXXX` notation can be used for the first 65536 Unicode characters. Characters with Unicode value greater than `U+FFFF` can also be represented with this notation, but in this case we will need to use a so called surrogate pair (we will talk about surrogate pairs later in this chapter).
529+
- `\u{X…XXXXXX}` -- a character with any given Unicode code point (a character with the given hex code in UTF-32 encoding).
530+
531+
`X…XXXXXX` must be a hexadimal value of 1 to 6 bytes between `0` and `10FFFF` (the highest code point defined by Unicode). This notation allows us to easily represent all existing Unicode characters.
532+
533+
Examples with Unicode:
534+
535+
```js run
536+
alert( "\uA9" ); // ©, the copyright symbol
563537

564-
You can skip the section if you don't plan to support them.
538+
alert( "\u00A9" ); // ©, the same as above, using the 4-digit hex notation
539+
alert( "\u044F" ); // я, the cyrillic alphabet letter
540+
alert( "\u2191" ); // ↑, the arrow up symbol
541+
542+
alert( "\u{20331}" ); // 佫, a rare Chinese hieroglyph (long Unicode)
543+
alert( "\u{1F60D}" ); // 😍, a smiling face symbol (another long Unicode)
565544
```
566545

567546
### Surrogate pairs
568547

569548
All frequently used characters have 2-byte codes. Letters in most european languages, numbers, and even most hieroglyphs, have a 2-byte representation.
570549

571-
But 2 bytes only allow 65536 combinations and that's not enough for every possible symbol. So rare symbols are encoded with a pair of 2-byte characters called "a surrogate pair".
550+
Initially, JavaScript was based on UTF-16 encoding that only allowed 2 bytes per character. But 2 bytes only allow 65536 combinations and that's not enough for every possible symbol of Unicode.
551+
552+
So rare symbols that require more than 2 bytes are encoded with a pair of 2-byte characters called "a surrogate pair".
572553

573-
The length of such symbols is `2`:
554+
As a side effect, the length of such symbols is `2`:
574555

575556
```js run
576557
alert( '𝒳'.length ); // 2, MATHEMATICAL SCRIPT CAPITAL X
577558
alert( '😂'.length ); // 2, FACE WITH TEARS OF JOY
578559
alert( '𩷶'.length ); // 2, a rare Chinese hieroglyph
579560
```
580561

581-
Note that surrogate pairs did not exist at the time when JavaScript was created, and thus are not correctly processed by the language!
562+
That's because surrogate pairs did not exist at the time when JavaScript was created, and thus are not correctly processed by the language!
582563

583-
We actually have a single symbol in each of the strings above, but the `length` shows a length of `2`.
564+
We actually have a single symbol in each of the strings above, but the `length` property shows a length of `2`.
584565

585-
`String.fromCodePoint` and `str.codePointAt` are few rare methods that deal with surrogate pairs right. They recently appeared in the language. Before them, there were only [String.fromCharCode](mdn:js/String/fromCharCode) and [str.charCodeAt](mdn:js/String/charCodeAt). These methods are actually the same as `fromCodePoint/codePointAt`, but don't work with surrogate pairs.
566+
Getting a symbol can also be tricky, because most language features treat surrogate pairs as two characters.
586567

587-
Getting a symbol can be tricky, because surrogate pairs are treated as two characters:
568+
For example, here we can see two odd characters in the output:
588569

589570
```js run
590-
alert( '𝒳'[0] ); // strange symbols...
571+
alert( '𝒳'[0] ); // shows strange symbols...
591572
alert( '𝒳'[1] ); // ...pieces of the surrogate pair
592573
```
593574

594-
Note that pieces of the surrogate pair have no meaning without each other. So the alerts in the example above actually display garbage.
575+
Pieces of a surrogate pair have no meaning without each other. So the alerts in the example above actually display garbage.
595576

596577
Technically, surrogate pairs are also detectable by their codes: if a character has the code in the interval of `0xd800..0xdbff`, then it is the first part of the surrogate pair. The next character (second part) must have the code in interval `0xdc00..0xdfff`. These intervals are reserved exclusively for surrogate pairs by the standard.
597578

598-
In the case above:
579+
So the methods `String.fromCodePoint` and `str.codePointAt` were added in JavaScript to deal with surrogate pairs.
580+
581+
They are essentially the same as [String.fromCharCode](mdn:js/String/fromCharCode) and [str.charCodeAt](mdn:js/String/charCodeAt), but they treat surrogate pairs correctly.
582+
583+
One can see the difference here:
599584

600585
```js run
601-
// charCodeAt is not surrogate-pair aware, so it gives codes for parts
586+
// charCodeAt is not surrogate-pair aware, so it gives codes for the 1st part of 𝒳:
602587

603-
alert( '𝒳'.charCodeAt(0).toString(16) ); // d835, between 0xd800 and 0xdbff
604-
alert( '𝒳'.charCodeAt(1).toString(16) ); // dcb3, between 0xdc00 and 0xdfff
588+
alert( '𝒳'.charCodeAt(0).toString(16) ); // d835
589+
590+
// codePointAt is surrogate-pair aware
591+
alert( '𝒳'.codePointAt(0).toString(16) ); // 1d4b3, reads both parts of the surrogate pair
592+
```
605593

606-
// codePointAt is surrogate-pair aware, but with its own specificity
594+
That said, if we take from position 1 (and that's rather incorrect here), then they both return only the 2nd part of the pair:
607595

608-
alert( '𝒳'.codePointAt(0).toString(16) ); // 1d4b3, reads both parts of the surrogate pair and returns the correct code for the symbol 𝒳
609-
alert( '𝒳'.codePointAt(1).toString(16) ); // dcb3, returns only the code for the second part of the surrogate pair
596+
```js run
597+
alert( '𝒳'.charCodeAt(1).toString(16) ); // dcb3
598+
alert( '𝒳'.codePointAt(1).toString(16) ); // dcb3
599+
// meaningless 2nd half of the pair
610600
```
611601

612602
You will find more ways to deal with surrogate pairs later in the chapter <info:iterable>. There are probably special libraries for that too, but nothing famous enough to suggest here.
613603

604+
````warn header="Takeaway: splitting strings at an arbitrary point is dangerous"
605+
We can't just split a string at an arbitrary position, e.g. take `str.slice(0, 4)` and expect it to be a valid string, e.g.:
606+
607+
```js run
608+
alert( 'hi 😂'.slice(0, 4) ); // hi [?]
609+
```
610+
611+
Here we can see a garbage character (first half of the smile surrogate pair) in the output.
612+
613+
Just be aware of it if you intend to reliably work with surrogate pairs. May not be a big problem, but at least you should understand what happens.
614+
````
615+
614616
### Diacritical marks and normalization
615617

616618
In many languages, there are symbols that are composed of the base character with a mark above/under it.
617619

618-
For instance, the letter `a` can be the base character for: `àáâäãåā`. Most common "composite" character have their own code in the Unicode table. But not all of them, because there are too many possible combinations.
620+
For instance, the letter `a` can be the base character for these characters: `àáâäãåā`.
621+
622+
Most common "composite" characters have their own code in the Unicode table. But not all of them, because there are too many possible combinations.
619623

620624
To support arbitrary compositions, Unicode standard allows us to use several Unicode characters: the base character followed by one or many "mark" characters that "decorate" it.
621625

@@ -671,7 +675,7 @@ If you want to learn more about normalization rules and variants -- they are des
671675
## Summary
672676

673677
- There are 3 types of quotes. Backticks allow a string to span multiple lines and embed expressions `${…}`.
674-
- Strings in JavaScript are encoded using UTF-16.
678+
- Strings in JavaScript are encoded using UTF-16, with surrogate pairs for rare characters (and these cause glitches).
675679
- We can use special characters like `\n` and insert letters by their Unicode using `\u...`.
676680
- To get a character, use: `[]`.
677681
- To get a substring, use: `slice` or `substring`.

0 commit comments

Comments
 (0)