You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: 1-js/05-data-types/03-string/article.md
+91-87Lines changed: 91 additions & 87 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -59,10 +59,10 @@ It is still possible to create multiline strings with single and double quotes b
59
59
```js run
60
60
let guestList ="Guests:\n * John\n * Pete\n * Mary";
61
61
62
-
alert(guestList); // a multiline list of guests
62
+
alert(guestList); // a multiline list of guests, same as above
63
63
```
64
64
65
-
For example, these two lines are equal, just written differently:
65
+
As a simpler example, these two lines are equal, just written differently:
66
66
67
67
```js run
68
68
let str1 ="Hello\nWorld"; // two lines using a "newline symbol"
@@ -74,33 +74,26 @@ World`;
74
74
alert(str1 == str2); // true
75
75
```
76
76
77
-
There are other, less common "special" characters.
78
-
79
-
Here's the full list:
77
+
There are other, less common "special" characters:
80
78
81
79
| Character | Description |
82
80
|-----------|-------------|
83
81
|`\n`|New line|
84
82
|`\r`|In Windows text files a combination of two characters `\r\n` represents a new break, while on non-Windows OS it's just `\n`. That's for historical reasons, most Windows software also understands `\n`. |
85
-
|`\'`,`\"`|Quotes|
83
+
|`\'`, `\"`, <code>\\`</code>|Quotes|
86
84
|`\\`|Backslash|
87
85
|`\t`|Tab|
88
-
|`\b`, `\f`, `\v`| Backspace, Form Feed, Vertical Tab -- kept for compatibility, not used nowadays. |
89
-
|`\xXX`|A character whose [Unicode](https://en.wikipedia.org/wiki/Unicode) code point is `U+00XX`. `XX` is always two hexadecimal digits with value between `00` and `FF`, so `\xXX` notation can be used only for the first 256 Unicode characters (including all 128 ASCII characters). For example, `"\x7A"` is the same as `"z"` (Unicode code point `U+007A`).|
|`\u{X…XXXXXX}` (1 to 6 hex characters)|A character with any given Unicode code point (a character with the given hex code in UTF-32 encoding). `X…XXXXXX` is a hex value between `0` and `10FFFF` (the highest code point defined by Unicode). This notation was added to the language in ECMAScript 2015 (ES6) standard and allows us to easily represent all existing Unicode characters without need for surrogate pairs. Unlike previous two notations, there is no need to add leading zeros for characters with "small" code point values: `"\u{7A}"`, `"\u{007A}"` and `"\u{00007A}"` are all acceptable.|
86
+
|`\b`, `\f`, `\v`| Backspace, Form Feed, Vertical Tab -- mentioned for completeness, coming from old times, not used nowadays (you can forget them right now). |
92
87
93
-
Examples with Unicode:
88
+
As you can see, all special characters start with a backslash character `\`. It is also called an "escape character".
89
+
90
+
Because it's so special, if we need to show an actual backslash `\` within the string, we need to double it:
alert( "\u{20331}" ); // 佫, a rare Chinese hieroglyph (long Unicode)
98
-
alert( "\u{1F60D}" ); // 😍, a smiling face symbol (another long Unicode)
93
+
alert( `The backslash: \\` ); // The backslash: \
99
94
```
100
95
101
-
All special characters start with a backslash character `\`. It is also called an "escape character".
102
-
103
-
We might also use it if we wanted to insert a quote into the string.
96
+
So-called "escaped" quotes `\'`, `\"`, <code>\\`</code> are used to insert a quote into the same-quoted string.
104
97
105
98
For instance:
106
99
@@ -113,18 +106,10 @@ As you can see, we have to prepend the inner quote by the backslash `\'`, becaus
113
106
Of course, only the quotes that are the same as the enclosing ones need to be escaped. So, as a more elegant solution, we could switch to double quotes or backticks instead:
114
107
115
108
```js run
116
-
alert( `I'm the Walrus!` ); // I'm the Walrus!
109
+
alert( "I'm the Walrus!" ); // I'm the Walrus!
117
110
```
118
111
119
-
Note that the backslash `\` serves for the correct reading of the string by JavaScript, then disappears. The in-memory string has no `\`. You can clearly see that in `alert` from the examples above.
120
-
121
-
But what if we need to show an actual backslash `\` within the string?
122
-
123
-
That's possible, but we need to double it like `\\`:
124
-
125
-
```js run
126
-
alert( `The backslash: \\` ); // The backslash: \
127
-
```
112
+
Besides these special characters, there's also a special notation for Unicode codes `\u…`, we'll cover it a bit later in this chapter.
128
113
129
114
## String length
130
115
@@ -310,45 +295,6 @@ if (str.indexOf("Widget") != -1) {
310
295
}
311
296
```
312
297
313
-
#### The bitwise NOT trick
314
-
315
-
One of the old tricks used here is the [bitwise NOT](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Operators/Bitwise_NOT)`~` operator. It converts the number to a 32-bit integer (removes the decimal part if exists) and then reverses all bits in its binary representation.
316
-
317
-
In practice, that means a simple thing: for 32-bit integers `~n` equals `-(n+1)`.
318
-
319
-
For instance:
320
-
321
-
```js run
322
-
alert( ~2 ); // -3, the same as -(2+1)
323
-
alert( ~1 ); // -2, the same as -(1+1)
324
-
alert( ~0 ); // -1, the same as -(0+1)
325
-
*!*
326
-
alert( ~-1 ); // 0, the same as -(-1+1)
327
-
*/!*
328
-
```
329
-
330
-
As we can see, `~n` is zero only if `n == -1` (that's for any 32-bit signed integer `n`).
331
-
332
-
So, the test `if ( ~str.indexOf("...") )` is truthy only if the result of `indexOf` is not `-1`. In other words, when there is a match.
333
-
334
-
People use it to shorten `indexOf` checks:
335
-
336
-
```js run
337
-
let str ="Widget";
338
-
339
-
if (~str.indexOf("Widget")) {
340
-
alert( 'Found it!' ); // works
341
-
}
342
-
```
343
-
344
-
It is usually not recommended to use language features in a non-obvious way, but this particular trick is widely used in old code, so we should understand it.
345
-
346
-
Just remember: `if (~str.indexOf(...))` reads as "if found".
347
-
348
-
To be precise though, as big numbers are truncated to 32 bits by `~` operator, there exist other numbers that give `0`, the smallest is `~4294967295=0`. That makes such check correct only if a string is not that long.
349
-
350
-
Right now we can see this trick only in the old code, as modern JavaScript provides `.includes` method (see below).
351
-
352
298
### includes, startsWith, endsWith
353
299
354
300
The more modern method [str.includes(substr, pos)](mdn:js/String/includes) returns `true/false` depending on whether `str` contains `substr` within.
@@ -407,7 +353,7 @@ There are 3 methods in JavaScript to get a substring: `substring`, `substr` and
407
353
```
408
354
409
355
`str.substring(start [, end])`
410
-
: Returns the part of the string *between*`start` and `end` (not including the greater of them).
356
+
: Returns the part of the string *between*`start` and `end` (not including `end`).
411
357
412
358
This is almost the same as `slice`, but it allows `start` to be greater than `end` (in this case it simply swaps `start` and `end` values).
413
359
@@ -452,13 +398,15 @@ Let's recap these methods to avoid any confusion:
452
398
| method | selects... | negatives |
453
399
|--------|-----------|-----------|
454
400
|`slice(start, end)`| from `start` to `end` (not including `end`) | allows negatives |
455
-
|`substring(start, end)`| between `start` and `end` (not including the greater of them)| negative values mean `0`|
401
+
|`substring(start, end)`| between `start` and `end` (not including `end`)| negative values mean `0`|
456
402
|`substr(start, length)`| from `start` get `length` characters | allows negative `start`|
457
403
458
404
```smart header="Which one to choose?"
459
405
All of them can do the job. Formally, `substr` has a minor drawback: it is described not in the core JavaScript specification, but in Annex B, which covers browser-only features that exist mainly for historical reasons. So, non-browser environments may fail to support it. But in practice it works everywhere.
460
406
461
-
Of the other two variants, `slice` is a little bit more flexible, it allows negative arguments and shorter to write. So, it's enough to remember solely `slice` of these three methods.
407
+
Of the other two variants, `slice` is a little bit more flexible, it allows negative arguments and shorter to write.
408
+
409
+
So, for practical use it's enough to remember only `slice`.
462
410
```
463
411
464
412
## Comparing strings
@@ -560,62 +508,118 @@ This method actually has two additional arguments specified in [the documentatio
560
508
561
509
```warn header="Advanced knowledge"
562
510
The section goes deeper into string internals. This knowledge will be useful for you if you plan to deal with emoji, rare mathematical or hieroglyphic characters or other rare symbols.
511
+
```
512
+
513
+
## Unicode characters
514
+
515
+
As we already mentioned, JavaScript strings are based on [Unicode](https://en.wikipedia.org/wiki/Unicode).
516
+
517
+
Each character is represented by a byte sequence of 1-4 bytes.
518
+
519
+
JavaScript allows us to specify a character by its Unicode value using these three notations:
520
+
521
+
-`\xXX` -- a character whose Unicode code point is `U+00XX`.
522
+
523
+
`XX` is always two hexadecimal digits with value between `00` and `FF`, so `\xXX` notation can be used only for the first 256 Unicode characters (including all 128 ASCII characters).
524
+
525
+
These first 256 characters include latin alphabet, most basic syntax characters and some others. For example, `"\x7A"` is the same as `"z"` (Unicode `U+007A`).
526
+
-`\uXXXX` -- a character whose Unicode code point is `U+XXXX` (a character with the hex code `XXXX` in UTF-16 encoding).
527
+
528
+
`XXXX` must be exactly 4 hex digits with the value between `0000` and `FFFF`, so `\uXXXX` notation can be used for the first 65536 Unicode characters. Characters with Unicode value greater than `U+FFFF` can also be represented with this notation, but in this case we will need to use a so called surrogate pair (we will talk about surrogate pairs later in this chapter).
529
+
-`\u{X…XXXXXX}` -- a character with any given Unicode code point (a character with the given hex code in UTF-32 encoding).
530
+
531
+
`X…XXXXXX` must be a hexadimal value of 1 to 6 bytes between `0` and `10FFFF` (the highest code point defined by Unicode). This notation allows us to easily represent all existing Unicode characters.
alert( "\u044F" ); // я, the cyrillic alphabet letter
540
+
alert( "\u2191" ); // ↑, the arrow up symbol
541
+
542
+
alert( "\u{20331}" ); // 佫, a rare Chinese hieroglyph (long Unicode)
543
+
alert( "\u{1F60D}" ); // 😍, a smiling face symbol (another long Unicode)
565
544
```
566
545
567
546
### Surrogate pairs
568
547
569
548
All frequently used characters have 2-byte codes. Letters in most european languages, numbers, and even most hieroglyphs, have a 2-byte representation.
570
549
571
-
But 2 bytes only allow 65536 combinations and that's not enough for every possible symbol. So rare symbols are encoded with a pair of 2-byte characters called "a surrogate pair".
550
+
Initially, JavaScript was based on UTF-16 encoding that only allowed 2 bytes per character. But 2 bytes only allow 65536 combinations and that's not enough for every possible symbol of Unicode.
551
+
552
+
So rare symbols that require more than 2 bytes are encoded with a pair of 2-byte characters called "a surrogate pair".
572
553
573
-
The length of such symbols is `2`:
554
+
As a side effect, the length of such symbols is `2`:
574
555
575
556
```js run
576
557
alert( '𝒳'.length ); // 2, MATHEMATICAL SCRIPT CAPITAL X
577
558
alert( '😂'.length ); // 2, FACE WITH TEARS OF JOY
578
559
alert( '𩷶'.length ); // 2, a rare Chinese hieroglyph
579
560
```
580
561
581
-
Note that surrogate pairs did not exist at the time when JavaScript was created, and thus are not correctly processed by the language!
562
+
That's because surrogate pairs did not exist at the time when JavaScript was created, and thus are not correctly processed by the language!
582
563
583
-
We actually have a single symbol in each of the strings above, but the `length` shows a length of `2`.
564
+
We actually have a single symbol in each of the strings above, but the `length`property shows a length of `2`.
584
565
585
-
`String.fromCodePoint` and `str.codePointAt` are few rare methods that deal with surrogate pairs right. They recently appeared in the language. Before them, there were only [String.fromCharCode](mdn:js/String/fromCharCode) and [str.charCodeAt](mdn:js/String/charCodeAt). These methods are actually the same as `fromCodePoint/codePointAt`, but don't work with surrogate pairs.
566
+
Getting a symbol can also be tricky, because most language features treat surrogate pairs as two characters.
586
567
587
-
Getting a symbol can be tricky, because surrogate pairs are treated as two characters:
568
+
For example, here we can see two odd characters in the output:
588
569
589
570
```js run
590
-
alert( '𝒳'[0] ); // strange symbols...
571
+
alert( '𝒳'[0] ); //shows strange symbols...
591
572
alert( '𝒳'[1] ); // ...pieces of the surrogate pair
592
573
```
593
574
594
-
Note that pieces of the surrogate pair have no meaning without each other. So the alerts in the example above actually display garbage.
575
+
Pieces of a surrogate pair have no meaning without each other. So the alerts in the example above actually display garbage.
595
576
596
577
Technically, surrogate pairs are also detectable by their codes: if a character has the code in the interval of `0xd800..0xdbff`, then it is the first part of the surrogate pair. The next character (second part) must have the code in interval `0xdc00..0xdfff`. These intervals are reserved exclusively for surrogate pairs by the standard.
597
578
598
-
In the case above:
579
+
So the methods `String.fromCodePoint` and `str.codePointAt` were added in JavaScript to deal with surrogate pairs.
580
+
581
+
They are essentially the same as [String.fromCharCode](mdn:js/String/fromCharCode) and [str.charCodeAt](mdn:js/String/charCodeAt), but they treat surrogate pairs correctly.
582
+
583
+
One can see the difference here:
599
584
600
585
```js run
601
-
// charCodeAt is not surrogate-pair aware, so it gives codes for parts
586
+
// charCodeAt is not surrogate-pair aware, so it gives codes for the 1st part of 𝒳:
602
587
603
-
alert( '𝒳'.charCodeAt(0).toString(16) ); // d835, between 0xd800 and 0xdbff
604
-
alert( '𝒳'.charCodeAt(1).toString(16) ); // dcb3, between 0xdc00 and 0xdfff
588
+
alert( '𝒳'.charCodeAt(0).toString(16) ); // d835
589
+
590
+
// codePointAt is surrogate-pair aware
591
+
alert( '𝒳'.codePointAt(0).toString(16) ); // 1d4b3, reads both parts of the surrogate pair
592
+
```
605
593
606
-
// codePointAt is surrogate-pair aware, but with its own specificity
594
+
That said, if we take from position 1 (and that's rather incorrect here), then they both return only the 2nd part of the pair:
607
595
608
-
alert( '𝒳'.codePointAt(0).toString(16) ); // 1d4b3, reads both parts of the surrogate pair and returns the correct code for the symbol 𝒳
609
-
alert( '𝒳'.codePointAt(1).toString(16) ); // dcb3, returns only the code for the second part of the surrogate pair
596
+
```js run
597
+
alert( '𝒳'.charCodeAt(1).toString(16) ); // dcb3
598
+
alert( '𝒳'.codePointAt(1).toString(16) ); // dcb3
599
+
// meaningless 2nd half of the pair
610
600
```
611
601
612
602
You will find more ways to deal with surrogate pairs later in the chapter <info:iterable>. There are probably special libraries for that too, but nothing famous enough to suggest here.
613
603
604
+
````warn header="Takeaway: splitting strings at an arbitrary point is dangerous"
605
+
We can't just split a string at an arbitrary position, e.g. take `str.slice(0, 4)` and expect it to be a valid string, e.g.:
606
+
607
+
```js run
608
+
alert( 'hi 😂'.slice(0, 4) ); // hi [?]
609
+
```
610
+
611
+
Here we can see a garbage character (first half of the smile surrogate pair) in the output.
612
+
613
+
Just be aware of it if you intend to reliably work with surrogate pairs. May not be a big problem, but at least you should understand what happens.
614
+
````
615
+
614
616
### Diacritical marks and normalization
615
617
616
618
In many languages, there are symbols that are composed of the base character with a mark above/under it.
617
619
618
-
For instance, the letter `a` can be the base character for: `àáâäãåā`. Most common "composite" character have their own code in the Unicode table. But not all of them, because there are too many possible combinations.
620
+
For instance, the letter `a` can be the base character for these characters: `àáâäãåā`.
621
+
622
+
Most common "composite" characters have their own code in the Unicode table. But not all of them, because there are too many possible combinations.
619
623
620
624
To support arbitrary compositions, Unicode standard allows us to use several Unicode characters: the base character followed by one or many "mark" characters that "decorate" it.
621
625
@@ -671,7 +675,7 @@ If you want to learn more about normalization rules and variants -- they are des
671
675
## Summary
672
676
673
677
- There are 3 types of quotes. Backticks allow a string to span multiple lines and embed expressions `${…}`.
674
-
- Strings in JavaScript are encoded using UTF-16.
678
+
- Strings in JavaScript are encoded using UTF-16, with surrogate pairs for rare characters (and these cause glitches).
675
679
- We can use special characters like `\n` and insert letters by their Unicode using `\u...`.
676
680
- To get a character, use: `[]`.
677
681
- To get a substring, use: `slice` or `substring`.
0 commit comments