Replace unicode with Unicode all over the book

This commit is contained in:
Vse Mozhe Buty 2020-12-07 18:05:47 +02:00
parent e87f130fc1
commit 7c73f64a13
6 changed files with 23 additions and 23 deletions

View file

@ -1,6 +1,6 @@
The maximal length must be `maxlength`, so we need to cut it a little shorter, to give space for the ellipsis.
Note that there is actually a single unicode character for an ellipsis. That's not three dots.
Note that there is actually a single Unicode character for an ellipsis. That's not three dots.
```js run demo
function truncate(str, maxlength) {

View file

@ -50,7 +50,7 @@ let guestList = "Guests: // Error: Unexpected token ILLEGAL
Single and double quotes come from ancient times of language creation when the need for multiline strings was not taken into account. Backticks appeared much later and thus are more versatile.
Backticks also allow us to specify a "template function" before the first backtick. The syntax is: <code>func&#96;string&#96;</code>. The function `func` is called automatically, receives the string and embedded expressions and can process them. This is called "tagged templates". This feature makes it easier to implement custom templating, but is rarely used in practice. You can read more about it in the [manual](mdn:/JavaScript/Reference/Template_literals#Tagged_templates).
Backticks also allow us to specify a "template function" before the first backtick. The syntax is: <code>func&#96;string&#96;</code>. The function `func` is called automatically, receives the string and embedded expressions and can process them. This is called "tagged templates". This feature makes it easier to implement custom templating, but is rarely used in practice. You can read more about it in the [manual](mdn:/JavaScript/Reference/Template_literals#Tagged_templates).
## Special characters
@ -86,16 +86,16 @@ Here's the full list:
|`\\`|Backslash|
|`\t`|Tab|
|`\b`, `\f`, `\v`| Backspace, Form Feed, Vertical Tab -- kept for compatibility, not used nowadays. |
|`\xXX`|Unicode character with the given hexadecimal unicode `XX`, e.g. `'\x7A'` is the same as `'z'`.|
|`\uXXXX`|A unicode symbol with the hex code `XXXX` in UTF-16 encoding, for instance `\u00A9` -- is a unicode for the copyright symbol `©`. It must be exactly 4 hex digits. |
|`\u{X…XXXXXX}` (1 to 6 hex characters)|A unicode symbol with the given UTF-32 encoding. Some rare characters are encoded with two unicode symbols, taking 4 bytes. This way we can insert long codes. |
|`\xXX`|Unicode character with the given hexadecimal Unicode `XX`, e.g. `'\x7A'` is the same as `'z'`.|
|`\uXXXX`|A Unicode symbol with the hex code `XXXX` in UTF-16 encoding, for instance `\u00A9` -- is a Unicode for the copyright symbol `©`. It must be exactly 4 hex digits. |
|`\u{X…XXXXXX}` (1 to 6 hex characters)|A Unicode symbol with the given UTF-32 encoding. Some rare characters are encoded with two Unicode symbols, taking 4 bytes. This way we can insert long codes. |
Examples with unicode:
Examples with Unicode:
```js run
alert( "\u00A9" ); // ©
alert( "\u{20331}" ); // 佫, a rare Chinese hieroglyph (long unicode)
alert( "\u{1F60D}" ); // 😍, a smiling face symbol (another long unicode)
alert( "\u{20331}" ); // 佫, a rare Chinese hieroglyph (long Unicode)
alert( "\u{1F60D}" ); // 😍, a smiling face symbol (another long Unicode)
```
All special characters start with a backslash character `\`. It is also called an "escape character".
@ -499,7 +499,7 @@ All strings are encoded using [UTF-16](https://en.wikipedia.org/wiki/UTF-16). Th
alert( String.fromCodePoint(90) ); // Z
```
We can also add unicode characters by their codes using `\u` followed by the hex code:
We can also add Unicode characters by their codes using `\u` followed by the hex code:
```js run
// 90 is 5a in hexadecimal system
@ -608,7 +608,7 @@ In many languages there are symbols that are composed of the base character with
For instance, the letter `a` can be the base character for: `àáâäãåā`. Most common "composite" character have their own code in the UTF-16 table. But not all of them, because there are too many possible combinations.
To support arbitrary compositions, UTF-16 allows us to use several unicode characters: the base character followed by one or many "mark" characters that "decorate" it.
To support arbitrary compositions, UTF-16 allows us to use several Unicode characters: the base character followed by one or many "mark" characters that "decorate" it.
For instance, if we have `S` followed by the special "dot above" character (code `\u0307`), it is shown as Ṡ.
@ -626,7 +626,7 @@ For example:
alert( 'S\u0307\u0323' ); // Ṩ
```
This provides great flexibility, but also an interesting problem: two characters may visually look the same, but be represented with different unicode compositions.
This provides great flexibility, but also an interesting problem: two characters may visually look the same, but be represented with different Unicode compositions.
For instance:
@ -639,7 +639,7 @@ alert( `s1: ${s1}, s2: ${s2}` );
alert( s1 == s2 ); // false though the characters look identical (?!)
```
To solve this, there exists a "unicode normalization" algorithm that brings each string to the single "normal" form.
To solve this, there exists a "Unicode normalization" algorithm that brings each string to the single "normal" form.
It is implemented by [str.normalize()](mdn:js/String/normalize).
@ -663,7 +663,7 @@ If you want to learn more about normalization rules and variants -- they are des
- There are 3 types of quotes. Backticks allow a string to span multiple lines and embed expressions `${…}`.
- Strings in JavaScript are encoded using UTF-16.
- We can use special characters like `\n` and insert letters by their unicode using `\u...`.
- We can use special characters like `\n` and insert letters by their Unicode using `\u...`.
- To get a character, use: `[]`.
- To get a substring, use: `slice` or `substring`.
- To lowercase/uppercase a string, use: `toLowerCase/toUpperCase`.

View file

@ -12,7 +12,7 @@ let decoder = new TextDecoder([label], [options]);
- **`label`** -- the encoding, `utf-8` by default, but `big5`, `windows-1251` and many other are also supported.
- **`options`** -- optional object:
- **`fatal`** -- boolean, if `true` then throw an exception for invalid (non-decodable) characters, otherwise (default) replace them with character `\uFFFD`.
- **`ignoreBOM`** -- boolean, if `true` then ignore BOM (an optional byte-order unicode mark), rarely needed.
- **`ignoreBOM`** -- boolean, if `true` then ignore BOM (an optional byte-order Unicode mark), rarely needed.
...And then decode:

View file

@ -56,7 +56,7 @@ There are only 6 of them in JavaScript:
: Enables "dotall" mode, that allows a dot `pattern:.` to match newline character `\n` (covered in the chapter <info:regexp-character-classes>).
`pattern:u`
: Enables full unicode support. The flag enables correct processing of surrogate pairs. More about that in the chapter <info:regexp-unicode>.
: Enables full Unicode support. The flag enables correct processing of surrogate pairs. More about that in the chapter <info:regexp-unicode>.
`pattern:y`
: "Sticky" mode: searching at the exact position in the text (covered in the chapter <info:regexp-sticky>)

View file

@ -4,9 +4,9 @@ JavaScript uses [Unicode encoding](https://en.wikipedia.org/wiki/Unicode) for st
That range is not big enough to encode all possible characters, that's why some rare characters are encoded with 4 bytes, for instance like `𝒳` (mathematical X) or `😄` (a smile), some hieroglyphs and so on.
Here are the unicode values of some characters:
Here are the Unicode values of some characters:
| Character | Unicode | Bytes count in unicode |
| Character | Unicode | Bytes count in Unicode |
|------------|---------|--------|
| a | `0x0061` | 2 |
| ≈ | `0x2248` | 2 |
@ -121,7 +121,7 @@ alert("number: xAF".match(regexp)); // xAF
Let's look for Chinese hieroglyphs.
There's a unicode property `Script` (a writing system), that may have a value: `Cyrillic`, `Greek`, `Arabic`, `Han` (Chinese) and so on, [here's the full list](https://en.wikipedia.org/wiki/Script_(Unicode)).
There's a Unicode property `Script` (a writing system), that may have a value: `Cyrillic`, `Greek`, `Arabic`, `Han` (Chinese) and so on, [here's the full list](https://en.wikipedia.org/wiki/Script_(Unicode)).
To look for characters in a given writing system we should use `pattern:Script=<value>`, e.g. for Cyrillic letters: `pattern:\p{sc=Cyrillic}`, for Chinese hieroglyphs: `pattern:\p{sc=Han}`, and so on:
@ -135,7 +135,7 @@ alert( str.match(regexp) ); // 你,好
### Example: currency
Characters that denote a currency, such as `$`, `€`, `¥`, have unicode property `pattern:\p{Currency_Symbol}`, the short alias: `pattern:\p{Sc}`.
Characters that denote a currency, such as `$`, `€`, `¥`, have Unicode property `pattern:\p{Currency_Symbol}`, the short alias: `pattern:\p{Sc}`.
Let's use it to look for prices in the format "currency, followed by a digit":

View file

@ -57,16 +57,16 @@ For instance:
- **\d** -- is the same as `pattern:[0-9]`,
- **\w** -- is the same as `pattern:[a-zA-Z0-9_]`,
- **\s** -- is the same as `pattern:[\t\n\v\f\r ]`, plus few other rare unicode space characters.
- **\s** -- is the same as `pattern:[\t\n\v\f\r ]`, plus few other rare Unicode space characters.
```
### Example: multi-language \w
As the character class `pattern:\w` is a shorthand for `pattern:[a-zA-Z0-9_]`, it can't find Chinese hieroglyphs, Cyrillic letters, etc.
We can write a more universal pattern, that looks for wordly characters in any language. That's easy with unicode properties: `pattern:[\p{Alpha}\p{M}\p{Nd}\p{Pc}\p{Join_C}]`.
We can write a more universal pattern, that looks for wordly characters in any language. That's easy with Unicode properties: `pattern:[\p{Alpha}\p{M}\p{Nd}\p{Pc}\p{Join_C}]`.
Let's decipher it. Similar to `pattern:\w`, we're making a set of our own that includes characters with following unicode properties:
Let's decipher it. Similar to `pattern:\w`, we're making a set of our own that includes characters with following Unicode properties:
- `Alphabetic` (`Alpha`) - for letters,
- `Mark` (`M`) - for accents,
@ -85,7 +85,7 @@ let str = `Hi 你好 12`;
alert( str.match(regexp) ); // H,i,你,好,1,2
```
Of course, we can edit this pattern: add unicode properties or remove them. Unicode properties are covered in more details in the article <info:regexp-unicode>.
Of course, we can edit this pattern: add Unicode properties or remove them. Unicode properties are covered in more details in the article <info:regexp-unicode>.
```warn header="Unicode properties aren't supported in Edge and Firefox"
Unicode properties `pattern:p{…}` are not yet implemented in Edge and Firefox. If we really need them, we can use library [XRegExp](http://xregexp.com/).