Replace unicode with Unicode all over the book

This commit is contained in:
Vse Mozhe Buty 2020-12-07 18:05:47 +02:00
parent e87f130fc1
commit 7c73f64a13
6 changed files with 23 additions and 23 deletions

View file

@ -1,6 +1,6 @@
The maximal length must be `maxlength`, so we need to cut it a little shorter, to give space for the ellipsis.
Note that there is actually a single unicode character for an ellipsis. That's not three dots.
Note that there is actually a single Unicode character for an ellipsis. That's not three dots.
```js run demo
function truncate(str, maxlength) {

View file

@ -86,16 +86,16 @@ Here's the full list:
|`\\`|Backslash|
|`\t`|Tab|
|`\b`, `\f`, `\v`| Backspace, Form Feed, Vertical Tab -- kept for compatibility, not used nowadays. |
|`\xXX`|Unicode character with the given hexadecimal unicode `XX`, e.g. `'\x7A'` is the same as `'z'`.|
|`\uXXXX`|A unicode symbol with the hex code `XXXX` in UTF-16 encoding, for instance `\u00A9` -- is a unicode for the copyright symbol `©`. It must be exactly 4 hex digits. |
|`\u{X…XXXXXX}` (1 to 6 hex characters)|A unicode symbol with the given UTF-32 encoding. Some rare characters are encoded with two unicode symbols, taking 4 bytes. This way we can insert long codes. |
|`\xXX`|Unicode character with the given hexadecimal Unicode `XX`, e.g. `'\x7A'` is the same as `'z'`.|
|`\uXXXX`|A Unicode symbol with the hex code `XXXX` in UTF-16 encoding, for instance `\u00A9` -- is a Unicode for the copyright symbol `©`. It must be exactly 4 hex digits. |
|`\u{X…XXXXXX}` (1 to 6 hex characters)|A Unicode symbol with the given UTF-32 encoding. Some rare characters are encoded with two Unicode symbols, taking 4 bytes. This way we can insert long codes. |
Examples with unicode:
Examples with Unicode:
```js run
alert( "\u00A9" ); // ©
alert( "\u{20331}" ); // 佫, a rare Chinese hieroglyph (long unicode)
alert( "\u{1F60D}" ); // 😍, a smiling face symbol (another long unicode)
alert( "\u{20331}" ); // 佫, a rare Chinese hieroglyph (long Unicode)
alert( "\u{1F60D}" ); // 😍, a smiling face symbol (another long Unicode)
```
All special characters start with a backslash character `\`. It is also called an "escape character".
@ -499,7 +499,7 @@ All strings are encoded using [UTF-16](https://en.wikipedia.org/wiki/UTF-16). Th
alert( String.fromCodePoint(90) ); // Z
```
We can also add unicode characters by their codes using `\u` followed by the hex code:
We can also add Unicode characters by their codes using `\u` followed by the hex code:
```js run
// 90 is 5a in hexadecimal system
@ -608,7 +608,7 @@ In many languages there are symbols that are composed of the base character with
For instance, the letter `a` can be the base character for: `àáâäãåā`. Most common "composite" character have their own code in the UTF-16 table. But not all of them, because there are too many possible combinations.
To support arbitrary compositions, UTF-16 allows us to use several unicode characters: the base character followed by one or many "mark" characters that "decorate" it.
To support arbitrary compositions, UTF-16 allows us to use several Unicode characters: the base character followed by one or many "mark" characters that "decorate" it.
For instance, if we have `S` followed by the special "dot above" character (code `\u0307`), it is shown as Ṡ.
@ -626,7 +626,7 @@ For example:
alert( 'S\u0307\u0323' ); // Ṩ
```
This provides great flexibility, but also an interesting problem: two characters may visually look the same, but be represented with different unicode compositions.
This provides great flexibility, but also an interesting problem: two characters may visually look the same, but be represented with different Unicode compositions.
For instance:
@ -639,7 +639,7 @@ alert( `s1: ${s1}, s2: ${s2}` );
alert( s1 == s2 ); // false though the characters look identical (?!)
```
To solve this, there exists a "unicode normalization" algorithm that brings each string to the single "normal" form.
To solve this, there exists a "Unicode normalization" algorithm that brings each string to the single "normal" form.
It is implemented by [str.normalize()](mdn:js/String/normalize).
@ -663,7 +663,7 @@ If you want to learn more about normalization rules and variants -- they are des
- There are 3 types of quotes. Backticks allow a string to span multiple lines and embed expressions `${…}`.
- Strings in JavaScript are encoded using UTF-16.
- We can use special characters like `\n` and insert letters by their unicode using `\u...`.
- We can use special characters like `\n` and insert letters by their Unicode using `\u...`.
- To get a character, use: `[]`.
- To get a substring, use: `slice` or `substring`.
- To lowercase/uppercase a string, use: `toLowerCase/toUpperCase`.

View file

@ -12,7 +12,7 @@ let decoder = new TextDecoder([label], [options]);
- **`label`** -- the encoding, `utf-8` by default, but `big5`, `windows-1251` and many other are also supported.
- **`options`** -- optional object:
- **`fatal`** -- boolean, if `true` then throw an exception for invalid (non-decodable) characters, otherwise (default) replace them with character `\uFFFD`.
- **`ignoreBOM`** -- boolean, if `true` then ignore BOM (an optional byte-order unicode mark), rarely needed.
- **`ignoreBOM`** -- boolean, if `true` then ignore BOM (an optional byte-order Unicode mark), rarely needed.
...And then decode:

View file

@ -56,7 +56,7 @@ There are only 6 of them in JavaScript:
: Enables "dotall" mode, that allows a dot `pattern:.` to match newline character `\n` (covered in the chapter <info:regexp-character-classes>).
`pattern:u`
: Enables full unicode support. The flag enables correct processing of surrogate pairs. More about that in the chapter <info:regexp-unicode>.
: Enables full Unicode support. The flag enables correct processing of surrogate pairs. More about that in the chapter <info:regexp-unicode>.
`pattern:y`
: "Sticky" mode: searching at the exact position in the text (covered in the chapter <info:regexp-sticky>)

View file

@ -4,9 +4,9 @@ JavaScript uses [Unicode encoding](https://en.wikipedia.org/wiki/Unicode) for st
That range is not big enough to encode all possible characters, that's why some rare characters are encoded with 4 bytes, for instance like `𝒳` (mathematical X) or `😄` (a smile), some hieroglyphs and so on.
Here are the unicode values of some characters:
Here are the Unicode values of some characters:
| Character | Unicode | Bytes count in unicode |
| Character | Unicode | Bytes count in Unicode |
|------------|---------|--------|
| a | `0x0061` | 2 |
| ≈ | `0x2248` | 2 |
@ -121,7 +121,7 @@ alert("number: xAF".match(regexp)); // xAF
Let's look for Chinese hieroglyphs.
There's a unicode property `Script` (a writing system), that may have a value: `Cyrillic`, `Greek`, `Arabic`, `Han` (Chinese) and so on, [here's the full list](https://en.wikipedia.org/wiki/Script_(Unicode)).
There's a Unicode property `Script` (a writing system), that may have a value: `Cyrillic`, `Greek`, `Arabic`, `Han` (Chinese) and so on, [here's the full list](https://en.wikipedia.org/wiki/Script_(Unicode)).
To look for characters in a given writing system we should use `pattern:Script=<value>`, e.g. for Cyrillic letters: `pattern:\p{sc=Cyrillic}`, for Chinese hieroglyphs: `pattern:\p{sc=Han}`, and so on:
@ -135,7 +135,7 @@ alert( str.match(regexp) ); // 你,好
### Example: currency
Characters that denote a currency, such as `$`, `€`, `¥`, have unicode property `pattern:\p{Currency_Symbol}`, the short alias: `pattern:\p{Sc}`.
Characters that denote a currency, such as `$`, `€`, `¥`, have Unicode property `pattern:\p{Currency_Symbol}`, the short alias: `pattern:\p{Sc}`.
Let's use it to look for prices in the format "currency, followed by a digit":

View file

@ -57,16 +57,16 @@ For instance:
- **\d** -- is the same as `pattern:[0-9]`,
- **\w** -- is the same as `pattern:[a-zA-Z0-9_]`,
- **\s** -- is the same as `pattern:[\t\n\v\f\r ]`, plus few other rare unicode space characters.
- **\s** -- is the same as `pattern:[\t\n\v\f\r ]`, plus few other rare Unicode space characters.
```
### Example: multi-language \w
As the character class `pattern:\w` is a shorthand for `pattern:[a-zA-Z0-9_]`, it can't find Chinese hieroglyphs, Cyrillic letters, etc.
We can write a more universal pattern, that looks for wordly characters in any language. That's easy with unicode properties: `pattern:[\p{Alpha}\p{M}\p{Nd}\p{Pc}\p{Join_C}]`.
We can write a more universal pattern, that looks for wordly characters in any language. That's easy with Unicode properties: `pattern:[\p{Alpha}\p{M}\p{Nd}\p{Pc}\p{Join_C}]`.
Let's decipher it. Similar to `pattern:\w`, we're making a set of our own that includes characters with following unicode properties:
Let's decipher it. Similar to `pattern:\w`, we're making a set of our own that includes characters with following Unicode properties:
- `Alphabetic` (`Alpha`) - for letters,
- `Mark` (`M`) - for accents,
@ -85,7 +85,7 @@ let str = `Hi 你好 12`;
alert( str.match(regexp) ); // H,i,你,好,1,2
```
Of course, we can edit this pattern: add unicode properties or remove them. Unicode properties are covered in more details in the article <info:regexp-unicode>.
Of course, we can edit this pattern: add Unicode properties or remove them. Unicode properties are covered in more details in the article <info:regexp-unicode>.
```warn header="Unicode properties aren't supported in Edge and Firefox"
Unicode properties `pattern:p{…}` are not yet implemented in Edge and Firefox. If we really need them, we can use library [XRegExp](http://xregexp.com/).