Replace unicode with Unicode all over the book
This commit is contained in:
parent
e87f130fc1
commit
7c73f64a13
6 changed files with 23 additions and 23 deletions
|
@ -1,6 +1,6 @@
|
|||
The maximal length must be `maxlength`, so we need to cut it a little shorter, to give space for the ellipsis.
|
||||
|
||||
Note that there is actually a single unicode character for an ellipsis. That's not three dots.
|
||||
Note that there is actually a single Unicode character for an ellipsis. That's not three dots.
|
||||
|
||||
```js run demo
|
||||
function truncate(str, maxlength) {
|
||||
|
|
|
@ -50,7 +50,7 @@ let guestList = "Guests: // Error: Unexpected token ILLEGAL
|
|||
|
||||
Single and double quotes come from ancient times of language creation when the need for multiline strings was not taken into account. Backticks appeared much later and thus are more versatile.
|
||||
|
||||
Backticks also allow us to specify a "template function" before the first backtick. The syntax is: <code>func`string`</code>. The function `func` is called automatically, receives the string and embedded expressions and can process them. This is called "tagged templates". This feature makes it easier to implement custom templating, but is rarely used in practice. You can read more about it in the [manual](mdn:/JavaScript/Reference/Template_literals#Tagged_templates).
|
||||
Backticks also allow us to specify a "template function" before the first backtick. The syntax is: <code>func`string`</code>. The function `func` is called automatically, receives the string and embedded expressions and can process them. This is called "tagged templates". This feature makes it easier to implement custom templating, but is rarely used in practice. You can read more about it in the [manual](mdn:/JavaScript/Reference/Template_literals#Tagged_templates).
|
||||
|
||||
## Special characters
|
||||
|
||||
|
@ -86,16 +86,16 @@ Here's the full list:
|
|||
|`\\`|Backslash|
|
||||
|`\t`|Tab|
|
||||
|`\b`, `\f`, `\v`| Backspace, Form Feed, Vertical Tab -- kept for compatibility, not used nowadays. |
|
||||
|`\xXX`|Unicode character with the given hexadecimal unicode `XX`, e.g. `'\x7A'` is the same as `'z'`.|
|
||||
|`\uXXXX`|A unicode symbol with the hex code `XXXX` in UTF-16 encoding, for instance `\u00A9` -- is a unicode for the copyright symbol `©`. It must be exactly 4 hex digits. |
|
||||
|`\u{X…XXXXXX}` (1 to 6 hex characters)|A unicode symbol with the given UTF-32 encoding. Some rare characters are encoded with two unicode symbols, taking 4 bytes. This way we can insert long codes. |
|
||||
|`\xXX`|Unicode character with the given hexadecimal Unicode `XX`, e.g. `'\x7A'` is the same as `'z'`.|
|
||||
|`\uXXXX`|A Unicode symbol with the hex code `XXXX` in UTF-16 encoding, for instance `\u00A9` -- is a Unicode for the copyright symbol `©`. It must be exactly 4 hex digits. |
|
||||
|`\u{X…XXXXXX}` (1 to 6 hex characters)|A Unicode symbol with the given UTF-32 encoding. Some rare characters are encoded with two Unicode symbols, taking 4 bytes. This way we can insert long codes. |
|
||||
|
||||
Examples with unicode:
|
||||
Examples with Unicode:
|
||||
|
||||
```js run
|
||||
alert( "\u00A9" ); // ©
|
||||
alert( "\u{20331}" ); // 佫, a rare Chinese hieroglyph (long unicode)
|
||||
alert( "\u{1F60D}" ); // 😍, a smiling face symbol (another long unicode)
|
||||
alert( "\u{20331}" ); // 佫, a rare Chinese hieroglyph (long Unicode)
|
||||
alert( "\u{1F60D}" ); // 😍, a smiling face symbol (another long Unicode)
|
||||
```
|
||||
|
||||
All special characters start with a backslash character `\`. It is also called an "escape character".
|
||||
|
@ -499,7 +499,7 @@ All strings are encoded using [UTF-16](https://en.wikipedia.org/wiki/UTF-16). Th
|
|||
alert( String.fromCodePoint(90) ); // Z
|
||||
```
|
||||
|
||||
We can also add unicode characters by their codes using `\u` followed by the hex code:
|
||||
We can also add Unicode characters by their codes using `\u` followed by the hex code:
|
||||
|
||||
```js run
|
||||
// 90 is 5a in hexadecimal system
|
||||
|
@ -608,7 +608,7 @@ In many languages there are symbols that are composed of the base character with
|
|||
|
||||
For instance, the letter `a` can be the base character for: `àáâäãåā`. Most common "composite" character have their own code in the UTF-16 table. But not all of them, because there are too many possible combinations.
|
||||
|
||||
To support arbitrary compositions, UTF-16 allows us to use several unicode characters: the base character followed by one or many "mark" characters that "decorate" it.
|
||||
To support arbitrary compositions, UTF-16 allows us to use several Unicode characters: the base character followed by one or many "mark" characters that "decorate" it.
|
||||
|
||||
For instance, if we have `S` followed by the special "dot above" character (code `\u0307`), it is shown as Ṡ.
|
||||
|
||||
|
@ -626,7 +626,7 @@ For example:
|
|||
alert( 'S\u0307\u0323' ); // Ṩ
|
||||
```
|
||||
|
||||
This provides great flexibility, but also an interesting problem: two characters may visually look the same, but be represented with different unicode compositions.
|
||||
This provides great flexibility, but also an interesting problem: two characters may visually look the same, but be represented with different Unicode compositions.
|
||||
|
||||
For instance:
|
||||
|
||||
|
@ -639,7 +639,7 @@ alert( `s1: ${s1}, s2: ${s2}` );
|
|||
alert( s1 == s2 ); // false though the characters look identical (?!)
|
||||
```
|
||||
|
||||
To solve this, there exists a "unicode normalization" algorithm that brings each string to the single "normal" form.
|
||||
To solve this, there exists a "Unicode normalization" algorithm that brings each string to the single "normal" form.
|
||||
|
||||
It is implemented by [str.normalize()](mdn:js/String/normalize).
|
||||
|
||||
|
@ -663,7 +663,7 @@ If you want to learn more about normalization rules and variants -- they are des
|
|||
|
||||
- There are 3 types of quotes. Backticks allow a string to span multiple lines and embed expressions `${…}`.
|
||||
- Strings in JavaScript are encoded using UTF-16.
|
||||
- We can use special characters like `\n` and insert letters by their unicode using `\u...`.
|
||||
- We can use special characters like `\n` and insert letters by their Unicode using `\u...`.
|
||||
- To get a character, use: `[]`.
|
||||
- To get a substring, use: `slice` or `substring`.
|
||||
- To lowercase/uppercase a string, use: `toLowerCase/toUpperCase`.
|
||||
|
|
|
@ -12,7 +12,7 @@ let decoder = new TextDecoder([label], [options]);
|
|||
- **`label`** -- the encoding, `utf-8` by default, but `big5`, `windows-1251` and many other are also supported.
|
||||
- **`options`** -- optional object:
|
||||
- **`fatal`** -- boolean, if `true` then throw an exception for invalid (non-decodable) characters, otherwise (default) replace them with character `\uFFFD`.
|
||||
- **`ignoreBOM`** -- boolean, if `true` then ignore BOM (an optional byte-order unicode mark), rarely needed.
|
||||
- **`ignoreBOM`** -- boolean, if `true` then ignore BOM (an optional byte-order Unicode mark), rarely needed.
|
||||
|
||||
...And then decode:
|
||||
|
||||
|
|
|
@ -56,7 +56,7 @@ There are only 6 of them in JavaScript:
|
|||
: Enables "dotall" mode, that allows a dot `pattern:.` to match newline character `\n` (covered in the chapter <info:regexp-character-classes>).
|
||||
|
||||
`pattern:u`
|
||||
: Enables full unicode support. The flag enables correct processing of surrogate pairs. More about that in the chapter <info:regexp-unicode>.
|
||||
: Enables full Unicode support. The flag enables correct processing of surrogate pairs. More about that in the chapter <info:regexp-unicode>.
|
||||
|
||||
`pattern:y`
|
||||
: "Sticky" mode: searching at the exact position in the text (covered in the chapter <info:regexp-sticky>)
|
||||
|
|
|
@ -4,9 +4,9 @@ JavaScript uses [Unicode encoding](https://en.wikipedia.org/wiki/Unicode) for st
|
|||
|
||||
That range is not big enough to encode all possible characters, that's why some rare characters are encoded with 4 bytes, for instance like `𝒳` (mathematical X) or `😄` (a smile), some hieroglyphs and so on.
|
||||
|
||||
Here are the unicode values of some characters:
|
||||
Here are the Unicode values of some characters:
|
||||
|
||||
| Character | Unicode | Bytes count in unicode |
|
||||
| Character | Unicode | Bytes count in Unicode |
|
||||
|------------|---------|--------|
|
||||
| a | `0x0061` | 2 |
|
||||
| ≈ | `0x2248` | 2 |
|
||||
|
@ -121,7 +121,7 @@ alert("number: xAF".match(regexp)); // xAF
|
|||
|
||||
Let's look for Chinese hieroglyphs.
|
||||
|
||||
There's a unicode property `Script` (a writing system), that may have a value: `Cyrillic`, `Greek`, `Arabic`, `Han` (Chinese) and so on, [here's the full list](https://en.wikipedia.org/wiki/Script_(Unicode)).
|
||||
There's a Unicode property `Script` (a writing system), that may have a value: `Cyrillic`, `Greek`, `Arabic`, `Han` (Chinese) and so on, [here's the full list](https://en.wikipedia.org/wiki/Script_(Unicode)).
|
||||
|
||||
To look for characters in a given writing system we should use `pattern:Script=<value>`, e.g. for Cyrillic letters: `pattern:\p{sc=Cyrillic}`, for Chinese hieroglyphs: `pattern:\p{sc=Han}`, and so on:
|
||||
|
||||
|
@ -135,7 +135,7 @@ alert( str.match(regexp) ); // 你,好
|
|||
|
||||
### Example: currency
|
||||
|
||||
Characters that denote a currency, such as `$`, `€`, `¥`, have unicode property `pattern:\p{Currency_Symbol}`, the short alias: `pattern:\p{Sc}`.
|
||||
Characters that denote a currency, such as `$`, `€`, `¥`, have Unicode property `pattern:\p{Currency_Symbol}`, the short alias: `pattern:\p{Sc}`.
|
||||
|
||||
Let's use it to look for prices in the format "currency, followed by a digit":
|
||||
|
||||
|
|
|
@ -57,16 +57,16 @@ For instance:
|
|||
|
||||
- **\d** -- is the same as `pattern:[0-9]`,
|
||||
- **\w** -- is the same as `pattern:[a-zA-Z0-9_]`,
|
||||
- **\s** -- is the same as `pattern:[\t\n\v\f\r ]`, plus few other rare unicode space characters.
|
||||
- **\s** -- is the same as `pattern:[\t\n\v\f\r ]`, plus few other rare Unicode space characters.
|
||||
```
|
||||
|
||||
### Example: multi-language \w
|
||||
|
||||
As the character class `pattern:\w` is a shorthand for `pattern:[a-zA-Z0-9_]`, it can't find Chinese hieroglyphs, Cyrillic letters, etc.
|
||||
|
||||
We can write a more universal pattern, that looks for wordly characters in any language. That's easy with unicode properties: `pattern:[\p{Alpha}\p{M}\p{Nd}\p{Pc}\p{Join_C}]`.
|
||||
We can write a more universal pattern, that looks for wordly characters in any language. That's easy with Unicode properties: `pattern:[\p{Alpha}\p{M}\p{Nd}\p{Pc}\p{Join_C}]`.
|
||||
|
||||
Let's decipher it. Similar to `pattern:\w`, we're making a set of our own that includes characters with following unicode properties:
|
||||
Let's decipher it. Similar to `pattern:\w`, we're making a set of our own that includes characters with following Unicode properties:
|
||||
|
||||
- `Alphabetic` (`Alpha`) - for letters,
|
||||
- `Mark` (`M`) - for accents,
|
||||
|
@ -85,7 +85,7 @@ let str = `Hi 你好 12`;
|
|||
alert( str.match(regexp) ); // H,i,你,好,1,2
|
||||
```
|
||||
|
||||
Of course, we can edit this pattern: add unicode properties or remove them. Unicode properties are covered in more details in the article <info:regexp-unicode>.
|
||||
Of course, we can edit this pattern: add Unicode properties or remove them. Unicode properties are covered in more details in the article <info:regexp-unicode>.
|
||||
|
||||
```warn header="Unicode properties aren't supported in Edge and Firefox"
|
||||
Unicode properties `pattern:p{…}` are not yet implemented in Edge and Firefox. If we really need them, we can use library [XRegExp](http://xregexp.com/).
|
||||
|
|
Loading…
Add table
Add a link
Reference in a new issue