Skip to content

Commit c8b4d34

Browse files
committed
move Unicode to a separate article
1 parent 4a9dc8e commit c8b4d34

File tree

3 files changed

+194
-198
lines changed

3 files changed

+194
-198
lines changed

1-js/05-data-types/03-string/1-ucfirst/solution.md

Lines changed: 1 addition & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -8,12 +8,7 @@ let newStr = str[0].toUpperCase() + str.slice(1);
88

99
There's a small problem though. If `str` is empty, then `str[0]` is `undefined`, and as `undefined` doesn't have the `toUpperCase()` method, we'll get an error.
1010

11-
There are two variants here:
12-
13-
1. Use `str.charAt(0)`, as it always returns a string (maybe empty).
14-
2. Add a test for an empty string.
15-
16-
Here's the 2nd variant:
11+
The easiest way out is to add a test for an empty string, like this:
1712

1813
```js run demo
1914
function ucFirst(str) {
@@ -24,4 +19,3 @@ function ucFirst(str) {
2419

2520
alert( ucFirst("john") ); // John
2621
```
27-

1-js/05-data-types/03-string/article.md

Lines changed: 21 additions & 191 deletions
Original file line numberDiff line numberDiff line change
@@ -50,7 +50,7 @@ let guestList = "Guests: // Error: Unexpected token ILLEGAL
5050

5151
Single and double quotes come from ancient times of language creation, when the need for multiline strings was not taken into account. Backticks appeared much later and thus are more versatile.
5252

53-
Backticks also allow us to specify a "template function" before the first backtick. The syntax is: <code>func&#96;string&#96;</code>. The function `func` is called automatically, receives the string and embedded expressions and can process them. This is called "tagged templates". This feature makes it easier to implement custom templating, but is rarely used in practice. You can read more about it in the [manual](mdn:/JavaScript/Reference/Template_literals#Tagged_templates).
53+
Backticks also allow us to specify a "template function" before the first backtick. The syntax is: <code>func&#96;string&#96;</code>. The function `func` is called automatically, receives the string and embedded expressions and can process them. This feature is called "tagged templates", it's rarely seen, but you can read about it in the MDN: [Template literals](mdn:/JavaScript/Reference/Template_literals#Tagged_templates).
5454

5555
## Special characters
5656

@@ -74,7 +74,7 @@ World`;
7474
alert(str1 == str2); // true
7575
```
7676

77-
There are other, less common "special" characters:
77+
There are other, less common special characters:
7878

7979
| Character | Description |
8080
|-----------|-------------|
@@ -109,7 +109,7 @@ Of course, only the quotes that are the same as the enclosing ones need to be es
109109
alert( "I'm the Walrus!" ); // I'm the Walrus!
110110
```
111111

112-
Besides these special characters, there's also a special notation for Unicode codes `\u…`, we'll cover it a bit later in this chapter.
112+
Besides these special characters, there's also a special notation for Unicode codes `\u…`, it's rarely used and is covered in the optional chapter about [Unicode](info:unicode).
113113

114114
## String length
115115

@@ -124,33 +124,36 @@ Note that `\n` is a single "special" character, so the length is indeed `3`.
124124
```warn header="`length` is a property"
125125
People with a background in some other languages sometimes mistype by calling `str.length()` instead of just `str.length`. That doesn't work.
126126

127-
Please note that `str.length` is a numeric property, not a function. There is no need to add parenthesis after it.
127+
Please note that `str.length` is a numeric property, not a function. There is no need to add parenthesis after it. Not `.length()`, but `.length`.
128128
```
129129
130130
## Accessing characters
131131
132-
To get a character at position `pos`, use square brackets `[pos]` or call the method [str.charAt(pos)](mdn:js/String/charAt). The first character starts from the zero position:
132+
To get a character at position `pos`, use square brackets `[pos]` or call the method [str.at(pos)](mdn:js/String/at). The first character starts from the zero position:
133133
134134
```js run
135135
let str = `Hello`;
136136
137137
// the first character
138138
alert( str[0] ); // H
139-
alert( str.charAt(0) ); // H
139+
alert( str.at(0) ); // H
140140
141141
// the last character
142142
alert( str[str.length - 1] ); // o
143+
alert( str.at(-1) );
143144
```
144145

145-
The square brackets are a modern way of getting a character, while `charAt` exists mostly for historical reasons.
146+
As you can see, the `.at(pos)` method has a benefit of allowing negative position. If `pos` is negative, then it's counted from the end of the string.
146147

147-
The only difference between them is that if no character is found, `[]` returns `undefined`, and `charAt` returns an empty string:
148+
So `.at(-1)` means the last character, and `.at(-2)` is the one before it, etc.
149+
150+
The square brackets always return `undefined` for negative indexes, for instance:
148151

149152
```js run
150153
let str = `Hello`;
151154

152-
alert( str[1000] ); // undefined
153-
alert( str.charAt(1000) ); // '' (an empty string)
155+
alert( str[-2] ); // undefined
156+
alert( str.at(-2) ); // l
154157
```
155158

156159
We can also iterate over characters using `for..of`:
@@ -429,9 +432,9 @@ Although, there are some oddities.
429432

430433
This may lead to strange results if we sort these country names. Usually people would expect `Zealand` to come after `Österreich` in the list.
431434

432-
To understand what happens, let's review the internal representation of strings in JavaScript.
435+
To understand what happens, we should be aware that strings in Javascript are encoded using [UTF-16](https://en.wikipedia.org/wiki/UTF-16). That is: each character has a corresponding numeric code.
433436

434-
All strings are encoded using [UTF-16](https://en.wikipedia.org/wiki/UTF-16). That is: each character has a corresponding numeric code. There are special methods that allow to get the character for the code and back.
437+
There are special methods that allow to get the character for the code and back:
435438

436439
`str.codePointAt(pos)`
437440
: Returns a decimal number representing the code for the character at position `pos`:
@@ -440,7 +443,7 @@ All strings are encoded using [UTF-16](https://en.wikipedia.org/wiki/UTF-16). Th
440443
// different case letters have different codes
441444
alert( "Z".codePointAt(0) ); // 90
442445
alert( "z".codePointAt(0) ); // 122
443-
alert( "z".codePointAt(0).toString(16) ); // 7a (if we need a more commonly used hex value of the code)
446+
alert( "z".codePointAt(0).toString(16) ); // 7a (if we need a hexadecimal value)
444447
```
445448

446449
`String.fromCodePoint(code)`
@@ -451,13 +454,6 @@ All strings are encoded using [UTF-16](https://en.wikipedia.org/wiki/UTF-16). Th
451454
alert( String.fromCodePoint(0x5a) ); // Z (we can also use a hex value as an argument)
452455
```
453456

454-
We can also add Unicode characters by their codes using `\u` followed by the hex code:
455-
456-
```js run
457-
// 90 is 5a in hexadecimal system
458-
alert( '\u005a' ); // Z
459-
```
460-
461457
Now let's see the characters with codes `65..220` (the latin alphabet and a little bit extra) by making a string of them:
462458
463459
```js run
@@ -467,6 +463,7 @@ for (let i = 65; i <= 220; i++) {
467463
str += String.fromCodePoint(i);
468464
}
469465
alert( str );
466+
// Output:
470467
// ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~€‚ƒ„
471468
// ¡¢£¤¥¦§¨©ª«¬­®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖרÙÚÛÜ
472469
```
@@ -486,7 +483,7 @@ The "right" algorithm to do string comparisons is more complex than it may seem,
486483
487484
So, the browser needs to know the language to compare.
488485
489-
Luckily, all modern browsers (IE10- requires the additional library [Intl.js](https://github.com/andyearnshaw/Intl.js/)) support the internationalization standard [ECMA-402](https://www.ecma-international.org/publications-and-standards/standards/ecma-402/).
486+
Luckily, modern browsers support the internationalization standard [ECMA-402](https://www.ecma-international.org/publications-and-standards/standards/ecma-402/).
490487
491488
It provides a special method to compare strings in different languages, following their rules.
492489
@@ -504,179 +501,10 @@ alert( 'Österreich'.localeCompare('Zealand') ); // -1
504501
505502
This method actually has two additional arguments specified in [the documentation](mdn:js/String/localeCompare), which allows it to specify the language (by default taken from the environment, letter order depends on the language) and setup additional rules like case sensitivity or should `"a"` and `"á"` be treated as the same etc.
506503
507-
## Internals, Unicode
508-
509-
```warn header="Advanced knowledge"
510-
The section goes deeper into string internals. This knowledge will be useful for you if you plan to deal with emoji, rare mathematical or hieroglyphic characters or other rare symbols.
511-
```
512-
513-
## Unicode characters
514-
515-
As we already mentioned, JavaScript strings are based on [Unicode](https://en.wikipedia.org/wiki/Unicode).
516-
517-
Each character is represented by a byte sequence of 1-4 bytes.
518-
519-
JavaScript allows us to specify a character not only by directly including it into a string, but also by its hexadecimal Unicode code using these three notations:
520-
521-
- `\xXX` -- a character whose Unicode code point is `U+00XX`.
522-
523-
`XX` is two hexadecimal digits with value between `00` and `FF`, so `\xXX` notation can be used only for the first 256 Unicode characters (including all 128 ASCII characters).
524-
525-
These first 256 characters include latin alphabet, most basic syntax characters and some others. For example, `"\x7A"` is the same as `"z"` (Unicode `U+007A`).
526-
- `\uXXXX` -- a character whose Unicode code point is `U+XXXX` (a character with the hex code `XXXX` in UTF-16 encoding).
527-
528-
`XXXX` must be exactly 4 hex digits with the value between `0000` and `FFFF`, so `\uXXXX` notation can be used for the first 65536 Unicode characters. Characters with Unicode value greater than `U+FFFF` can also be represented with this notation, but in this case we will need to use a so called surrogate pair (we will talk about surrogate pairs later in this chapter).
529-
- `\u{X…XXXXXX}` -- a character with any given Unicode code point (a character with the given hex code in UTF-32 encoding).
530-
531-
`X…XXXXXX` must be a hexadecimal value of 1 to 6 bytes between `0` and `10FFFF` (the highest code point defined by Unicode). This notation allows us to easily represent all existing Unicode characters.
532-
533-
Examples with Unicode:
534-
535-
```js run
536-
alert( "\uA9" ); // ©, the copyright symbol
537-
538-
alert( "\u00A9" ); // ©, the same as above, using the 4-digit hex notation
539-
alert( "\u044F" ); // я, the cyrillic alphabet letter
540-
alert( "\u2191" ); // ↑, the arrow up symbol
541-
542-
alert( "\u{20331}" ); // 佫, a rare Chinese hieroglyph (long Unicode)
543-
alert( "\u{1F60D}" ); // 😍, a smiling face symbol (another long Unicode)
544-
```
545-
546-
### Surrogate pairs
547-
548-
All frequently used characters have 2-byte codes. Letters in most european languages, numbers, and even most hieroglyphs, have a 2-byte representation.
549-
550-
Initially, JavaScript was based on UTF-16 encoding that only allowed 2 bytes per character. But 2 bytes only allow 65536 combinations and that's not enough for every possible symbol of Unicode.
551-
552-
So rare symbols that require more than 2 bytes are encoded with a pair of 2-byte characters called "a surrogate pair".
553-
554-
As a side effect, the length of such symbols is `2`:
555-
556-
```js run
557-
alert( '𝒳'.length ); // 2, MATHEMATICAL SCRIPT CAPITAL X
558-
alert( '😂'.length ); // 2, FACE WITH TEARS OF JOY
559-
alert( '𩷶'.length ); // 2, a rare Chinese hieroglyph
560-
```
561-
562-
That's because surrogate pairs did not exist at the time when JavaScript was created, and thus are not correctly processed by the language!
563-
564-
We actually have a single symbol in each of the strings above, but the `length` property shows a length of `2`.
565-
566-
Getting a symbol can also be tricky, because most language features treat surrogate pairs as two characters.
567-
568-
For example, here we can see two odd characters in the output:
569-
570-
```js run
571-
alert( '𝒳'[0] ); // shows strange symbols...
572-
alert( '𝒳'[1] ); // ...pieces of the surrogate pair
573-
```
574-
575-
Pieces of a surrogate pair have no meaning without each other. So the alerts in the example above actually display garbage.
576-
577-
Technically, surrogate pairs are also detectable by their codes: if a character has the code in the interval of `0xd800..0xdbff`, then it is the first part of the surrogate pair. The next character (second part) must have the code in interval `0xdc00..0xdfff`. These intervals are reserved exclusively for surrogate pairs by the standard.
578-
579-
So the methods `String.fromCodePoint` and `str.codePointAt` were added in JavaScript to deal with surrogate pairs.
580-
581-
They are essentially the same as [String.fromCharCode](mdn:js/String/fromCharCode) and [str.charCodeAt](mdn:js/String/charCodeAt), but they treat surrogate pairs correctly.
582-
583-
One can see the difference here:
584-
585-
```js run
586-
// charCodeAt is not surrogate-pair aware, so it gives codes for the 1st part of 𝒳:
587-
588-
alert( '𝒳'.charCodeAt(0).toString(16) ); // d835
589-
590-
// codePointAt is surrogate-pair aware
591-
alert( '𝒳'.codePointAt(0).toString(16) ); // 1d4b3, reads both parts of the surrogate pair
592-
```
593-
594-
That said, if we take from position 1 (and that's rather incorrect here), then they both return only the 2nd part of the pair:
595-
596-
```js run
597-
alert( '𝒳'.charCodeAt(1).toString(16) ); // dcb3
598-
alert( '𝒳'.codePointAt(1).toString(16) ); // dcb3
599-
// meaningless 2nd half of the pair
600-
```
601-
602-
You will find more ways to deal with surrogate pairs later in the chapter <info:iterable>. There are probably special libraries for that too, but nothing famous enough to suggest here.
603-
604-
````warn header="Takeaway: splitting strings at an arbitrary point is dangerous"
605-
We can't just split a string at an arbitrary position, e.g. take `str.slice(0, 4)` and expect it to be a valid string, e.g.:
606-
607-
```js run
608-
alert( 'hi 😂'.slice(0, 4) ); // hi [?]
609-
```
610-
611-
Here we can see a garbage character (first half of the smile surrogate pair) in the output.
612-
613-
Just be aware of it if you intend to reliably work with surrogate pairs. May not be a big problem, but at least you should understand what happens.
614-
````
615-
616-
### Diacritical marks and normalization
617-
618-
In many languages, there are symbols that are composed of the base character with a mark above/under it.
619-
620-
For instance, the letter `a` can be the base character for these characters: `àáâäãåā`.
621-
622-
Most common "composite" characters have their own code in the Unicode table. But not all of them, because there are too many possible combinations.
623-
624-
To support arbitrary compositions, Unicode standard allows us to use several Unicode characters: the base character followed by one or many "mark" characters that "decorate" it.
625-
626-
For instance, if we have `S` followed by the special "dot above" character (code `\u0307`), it is shown as Ṡ.
627-
628-
```js run
629-
alert( 'S\u0307' ); //
630-
```
631-
632-
If we need an additional mark above the letter (or below it) -- no problem, just add the necessary mark character.
633-
634-
For instance, if we append a character "dot below" (code `\u0323`), then we'll have "S with dots above and below": `Ṩ`.
635-
636-
For example:
637-
638-
```js run
639-
alert( 'S\u0307\u0323' ); // Ṩ
640-
```
641-
642-
This provides great flexibility, but also an interesting problem: two characters may visually look the same, but be represented with different Unicode compositions.
643-
644-
For instance:
645-
646-
```js run
647-
let s1 = 'S\u0307\u0323'; // Ṩ, S + dot above + dot below
648-
let s2 = 'S\u0323\u0307'; // Ṩ, S + dot below + dot above
649-
650-
alert( `s1: ${s1}, s2: ${s2}` );
651-
652-
alert( s1 == s2 ); // false though the characters look identical (?!)
653-
```
654-
655-
To solve this, there exists a "Unicode normalization" algorithm that brings each string to the single "normal" form.
656-
657-
It is implemented by [str.normalize()](mdn:js/String/normalize).
658-
659-
```js run
660-
alert( "S\u0307\u0323".normalize() == "S\u0323\u0307".normalize() ); // true
661-
```
662-
663-
It's funny that in our situation `normalize()` actually brings together a sequence of 3 characters to one: `\u1e68` (S with two dots).
664-
665-
```js run
666-
alert( "S\u0307\u0323".normalize().length ); // 1
667-
668-
alert( "S\u0307\u0323".normalize() == "\u1e68" ); // true
669-
```
670-
671-
In reality, this is not always the case. The reason being that the symbol `` is "common enough", so Unicode creators included it in the main table and gave it the code.
672-
673-
If you want to learn more about normalization rules and variants -- they are described in the appendix of the Unicode standard: [Unicode Normalization Forms](https://www.unicode.org/reports/tr15/), but for most practical purposes the information from this section is enough.
674-
675504
## Summary
676505
677506
- There are 3 types of quotes. Backticks allow a string to span multiple lines and embed expressions `${…}`.
678-
- Strings in JavaScript are encoded using UTF-16, with surrogate pairs for rare characters (and these cause glitches).
679-
- We can use special characters like `\n` and insert letters by their Unicode using `\u...`.
507+
- We can use special characters, such as a line break `\n`.
680508
- To get a character, use: `[]`.
681509
- To get a substring, use: `slice` or `substring`.
682510
- To lowercase/uppercase a string, use: `toLowerCase/toUpperCase`.
@@ -690,3 +518,5 @@ There are several other helpful methods in strings:
690518
- ...and more to be found in the [manual](mdn:js/String).
691519
692520
Strings also have methods for doing search/replace with regular expressions. But that's big topic, so it's explained in a separate tutorial section <info:regular-expressions>.
521+
522+
Also, as of now it's important to know that strings are based on Unicode encoding, and hence there're issues with comparisons. There's more about Unicode in the chapter <info:unicode>.

0 commit comments

Comments
 (0)