From dca45f773bbc098f0cf0ad2430d5f8c4ba1c456b Mon Sep 17 00:00:00 2001 From: joaquinelio Date: Wed, 5 Oct 2022 11:29:53 -0300 Subject: [PATCH 1/7] Unicode art, grammar suggestions --- 1-js/99-js-misc/06-unicode/article.md | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/1-js/99-js-misc/06-unicode/article.md b/1-js/99-js-misc/06-unicode/article.md index 2396fcfa..2268713e 100644 --- a/1-js/99-js-misc/06-unicode/article.md +++ b/1-js/99-js-misc/06-unicode/article.md @@ -2,7 +2,7 @@ # Unicode, String internals ```warn header="Advanced knowledge" -The section goes deeper into string internals. This knowledge will be useful for you if you plan to deal with emoji, rare mathematical or hieroglyphic characters or other rare symbols. +The section goes deeper into string internals. This knowledge will be useful for you if you plan to deal with emoji, rare mathematical or hieroglyphic characters, or other rare symbols. ``` As we already know, JavaScript strings are based on [Unicode](https://en.wikipedia.org/wiki/Unicode): each character is represented by a byte sequence of 1-4 bytes. @@ -11,11 +11,11 @@ JavaScript allows us to insert a character into a string by specifying its hexad - `\xXX` - `XX` must be two hexadecimal digits with value between `00` and `FF`, then it's character whose Unicode code is `XX`. + `XX` must be two hexadecimal digits with a value between `00` and `FF`, then it's a character whose Unicode code is `XX`. Because the `\xXX` notation supports only two digits, it can be used only for the first 256 Unicode characters. - These first 256 characters include latin alphabet, most basic syntax characters and some others. For example, `"\x7A"` is the same as `"z"` (Unicode `U+007A`). + These first 256 characters include the latin alphabet, most basic syntax characters, and some others. For example, `"\x7A"` is the same as `"z"` (Unicode `U+007A`). ```js run alert( "\x7A" ); // z @@ -23,9 +23,9 @@ JavaScript allows us to insert a character into a string by specifying its hexad ``` - `\uXXXX` - `XXXX` must be exactly 4 hex digits with the value between `0000` and `FFFF`, then `\uXXXX` is a character whose Unicode code is `XXXX` . + `XXXX` must be exactly 4 hex digits with the value between `0000` and `FFFF`, then `\uXXXX` is a character whose Unicode code is `XXXX`. - Characters with Unicode value greater than `U+FFFF` can also be represented with this notation, but in this case we will need to use a so called surrogate pair (we will talk about surrogate pairs later in this chapter). + Characters with Unicode values greater than `U+FFFF` can also be represented with this notation, but in this case, we will need to use a so called surrogate pair (we will talk about surrogate pairs later in this chapter). ```js run alert( "\u00A9" ); // ©, the same as \xA9, using the 4-digit hex notation @@ -120,7 +120,7 @@ For instance, the letter `a` can be the base character for these characters: `à Most common "composite" characters have their own code in the Unicode table. But not all of them, because there are too many possible combinations. -To support arbitrary compositions, Unicode standard allows us to use several Unicode characters: the base character followed by one or many "mark" characters that "decorate" it. +To support arbitrary compositions, the Unicode standard allows us to use several Unicode characters: the base character followed by one or many "mark" characters that "decorate" it. For instance, if we have `S` followed by the special "dot above" character (code `\u0307`), it is shown as Ṡ. @@ -167,6 +167,6 @@ alert( "S\u0307\u0323".normalize().length ); // 1 alert( "S\u0307\u0323".normalize() == "\u1e68" ); // true ``` -In reality, this is not always the case. The reason being that the symbol `Ṩ` is "common enough", so Unicode creators included it in the main table and gave it the code. +In reality, this is not always the case. The reason is that the symbol `Ṩ` is "common enough", so Unicode creators included it in the main table and gave it the code. If you want to learn more about normalization rules and variants -- they are described in the appendix of the Unicode standard: [Unicode Normalization Forms](https://www.unicode.org/reports/tr15/), but for most practical purposes the information from this section is enough. From dc7a157d8f0fd43d73253828d1f21668b3d24c00 Mon Sep 17 00:00:00 2001 From: joaquinelio Date: Wed, 5 Oct 2022 11:57:18 -0300 Subject: [PATCH 2/7] Update article.md --- 1-js/99-js-misc/06-unicode/article.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/1-js/99-js-misc/06-unicode/article.md b/1-js/99-js-misc/06-unicode/article.md index 2268713e..e0c08d97 100644 --- a/1-js/99-js-misc/06-unicode/article.md +++ b/1-js/99-js-misc/06-unicode/article.md @@ -11,7 +11,7 @@ JavaScript allows us to insert a character into a string by specifying its hexad - `\xXX` - `XX` must be two hexadecimal digits with a value between `00` and `FF`, then it's a character whose Unicode code is `XX`. + `XX` must be two hexadecimal digits with a value between `00` and `FF`, then `\xXX` is the character whose Unicode code is `XX`. Because the `\xXX` notation supports only two digits, it can be used only for the first 256 Unicode characters. @@ -23,7 +23,7 @@ JavaScript allows us to insert a character into a string by specifying its hexad ``` - `\uXXXX` - `XXXX` must be exactly 4 hex digits with the value between `0000` and `FFFF`, then `\uXXXX` is a character whose Unicode code is `XXXX`. + `XXXX` must be exactly 4 hex digits with the value between `0000` and `FFFF`, then `\uXXXX` is the character whose Unicode code is `XXXX`. Characters with Unicode values greater than `U+FFFF` can also be represented with this notation, but in this case, we will need to use a so called surrogate pair (we will talk about surrogate pairs later in this chapter). From 306a197d2435ba971b4a67bfbbc0b06046fde56c Mon Sep 17 00:00:00 2001 From: joaquinelio Date: Mon, 10 Oct 2022 11:31:13 -0300 Subject: [PATCH 3/7] Update article.md --- 1-js/99-js-misc/06-unicode/article.md | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/1-js/99-js-misc/06-unicode/article.md b/1-js/99-js-misc/06-unicode/article.md index e0c08d97..6014bfe8 100644 --- a/1-js/99-js-misc/06-unicode/article.md +++ b/1-js/99-js-misc/06-unicode/article.md @@ -2,7 +2,7 @@ # Unicode, String internals ```warn header="Advanced knowledge" -The section goes deeper into string internals. This knowledge will be useful for you if you plan to deal with emoji, rare mathematical or hieroglyphic characters, or other rare symbols. +The section goes deeper into string internals. This knowledge will be useful for you if you plan to deal with emoji, rare mathematical or logographic characters, or other rare symbols. ``` As we already know, JavaScript strings are based on [Unicode](https://en.wikipedia.org/wiki/Unicode): each character is represented by a byte sequence of 1-4 bytes. @@ -13,9 +13,9 @@ JavaScript allows us to insert a character into a string by specifying its hexad `XX` must be two hexadecimal digits with a value between `00` and `FF`, then `\xXX` is the character whose Unicode code is `XX`. - Because the `\xXX` notation supports only two digits, it can be used only for the first 256 Unicode characters. + Because the `\xXX` notation supports only two hexadecimal digits, it can be used only for the first 256 Unicode characters. - These first 256 characters include the latin alphabet, most basic syntax characters, and some others. For example, `"\x7A"` is the same as `"z"` (Unicode `U+007A`). + These first 256 characters include the Latin alphabet, most basic syntax characters, and some others. For example, `"\x7A"` is the same as `"z"` (Unicode `U+007A`). ```js run alert( "\x7A" ); // z @@ -29,7 +29,7 @@ JavaScript allows us to insert a character into a string by specifying its hexad ```js run alert( "\u00A9" ); // ©, the same as \xA9, using the 4-digit hex notation - alert( "\u044F" ); // я, the cyrillic alphabet letter + alert( "\u044F" ); // я, the Cyrillic alphabet letter alert( "\u2191" ); // ↑, the arrow up symbol ``` @@ -38,13 +38,13 @@ JavaScript allows us to insert a character into a string by specifying its hexad `X…XXXXXX` must be a hexadecimal value of 1 to 6 bytes between `0` and `10FFFF` (the highest code point defined by Unicode). This notation allows us to easily represent all existing Unicode characters. ```js run - alert( "\u{20331}" ); // 佫, a rare Chinese hieroglyph (long Unicode) + alert( "\u{20331}" ); // 佫, a rare Chinese character (long Unicode) alert( "\u{1F60D}" ); // 😍, a smiling face symbol (another long Unicode) ``` ## Surrogate pairs -All frequently used characters have 2-byte codes. Letters in most european languages, numbers, and even most hieroglyphs, have a 2-byte representation. +All frequently used characters have 2-byte codes (4 hex digits). Letters in most European languages, numbers, and the basic CJK ideograph set (from Chinese, Japanese, and Korean writing systems), have a 2-byte representation. Initially, JavaScript was based on UTF-16 encoding that only allowed 2 bytes per character. But 2 bytes only allow 65536 combinations and that's not enough for every possible symbol of Unicode. @@ -55,7 +55,7 @@ As a side effect, the length of such symbols is `2`: ```js run alert( '𝒳'.length ); // 2, MATHEMATICAL SCRIPT CAPITAL X alert( '😂'.length ); // 2, FACE WITH TEARS OF JOY -alert( '𩷶'.length ); // 2, a rare Chinese hieroglyph +alert( '𩷶'.length ); // 2, a rare Chinese character ``` That's because surrogate pairs did not exist at the time when JavaScript was created, and thus are not correctly processed by the language! From 69bfbb04cb5ec60918681ba1b6e97005edc419af Mon Sep 17 00:00:00 2001 From: joaquinelio Date: Mon, 10 Oct 2022 11:36:44 -0300 Subject: [PATCH 4/7] Update article.md --- 1-js/99-js-misc/06-unicode/article.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/1-js/99-js-misc/06-unicode/article.md b/1-js/99-js-misc/06-unicode/article.md index 6014bfe8..caafeda0 100644 --- a/1-js/99-js-misc/06-unicode/article.md +++ b/1-js/99-js-misc/06-unicode/article.md @@ -44,7 +44,7 @@ JavaScript allows us to insert a character into a string by specifying its hexad ## Surrogate pairs -All frequently used characters have 2-byte codes (4 hex digits). Letters in most European languages, numbers, and the basic CJK ideograph set (from Chinese, Japanese, and Korean writing systems), have a 2-byte representation. +All frequently used characters have 2-byte codes (4 hex digits). Letters in most European languages, numbers, and the basic unified CJK ideograph sets (CJK, from Chinese, Japanese, and Korean writing systems), have a 2-byte representation. Initially, JavaScript was based on UTF-16 encoding that only allowed 2 bytes per character. But 2 bytes only allow 65536 combinations and that's not enough for every possible symbol of Unicode. From b89b938a06e75cca4af8c7aeb2fd1690800de2d0 Mon Sep 17 00:00:00 2001 From: joaquinelio Date: Mon, 10 Oct 2022 11:51:18 -0300 Subject: [PATCH 5/7] Update article.md --- 1-js/99-js-misc/06-unicode/article.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/1-js/99-js-misc/06-unicode/article.md b/1-js/99-js-misc/06-unicode/article.md index caafeda0..04a445ca 100644 --- a/1-js/99-js-misc/06-unicode/article.md +++ b/1-js/99-js-misc/06-unicode/article.md @@ -44,7 +44,7 @@ JavaScript allows us to insert a character into a string by specifying its hexad ## Surrogate pairs -All frequently used characters have 2-byte codes (4 hex digits). Letters in most European languages, numbers, and the basic unified CJK ideograph sets (CJK, from Chinese, Japanese, and Korean writing systems), have a 2-byte representation. +All frequently used characters have 2-byte codes (4 hex digits). Letters in most European languages, numbers, and the basic unified CJK ideograph sets (CJK -- from Chinese, Japanese, and Korean writing systems), have a 2-byte representation. Initially, JavaScript was based on UTF-16 encoding that only allowed 2 bytes per character. But 2 bytes only allow 65536 combinations and that's not enough for every possible symbol of Unicode. From 455c57aa5586a26eae5d435907381ee89fe67d1b Mon Sep 17 00:00:00 2001 From: joaquinelio Date: Mon, 10 Oct 2022 11:54:42 -0300 Subject: [PATCH 6/7] Update article.md --- 1-js/99-js-misc/06-unicode/article.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/1-js/99-js-misc/06-unicode/article.md b/1-js/99-js-misc/06-unicode/article.md index 04a445ca..96b93e74 100644 --- a/1-js/99-js-misc/06-unicode/article.md +++ b/1-js/99-js-misc/06-unicode/article.md @@ -44,7 +44,7 @@ JavaScript allows us to insert a character into a string by specifying its hexad ## Surrogate pairs -All frequently used characters have 2-byte codes (4 hex digits). Letters in most European languages, numbers, and the basic unified CJK ideograph sets (CJK -- from Chinese, Japanese, and Korean writing systems), have a 2-byte representation. +All frequently used characters have 2-byte codes (4 hex digits). Letters in most European languages, numbers, and the basic unified CJK ideographic sets (CJK -- from Chinese, Japanese, and Korean writing systems), have a 2-byte representation. Initially, JavaScript was based on UTF-16 encoding that only allowed 2 bytes per character. But 2 bytes only allow 65536 combinations and that's not enough for every possible symbol of Unicode. From 6f349121e30cd625fe71e9bb76ef8f3f2de158e0 Mon Sep 17 00:00:00 2001 From: joaquinelio Date: Mon, 10 Oct 2022 12:01:15 -0300 Subject: [PATCH 7/7] Update article.md --- 1-js/99-js-misc/06-unicode/article.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/1-js/99-js-misc/06-unicode/article.md b/1-js/99-js-misc/06-unicode/article.md index 96b93e74..c2198989 100644 --- a/1-js/99-js-misc/06-unicode/article.md +++ b/1-js/99-js-misc/06-unicode/article.md @@ -2,7 +2,7 @@ # Unicode, String internals ```warn header="Advanced knowledge" -The section goes deeper into string internals. This knowledge will be useful for you if you plan to deal with emoji, rare mathematical or logographic characters, or other rare symbols. +The section goes deeper into string internals. This knowledge will be useful for you if you plan to deal with emoji, rare mathematical or hieroglyphic characters, or other rare symbols. ``` As we already know, JavaScript strings are based on [Unicode](https://en.wikipedia.org/wiki/Unicode): each character is represented by a byte sequence of 1-4 bytes.