WIP

2019-09-05 14:57:06 +03:00 · 2019-09-05 14:57:06 +03:00 · 20547570ff
commit 20547570ff
parent fc0b18538d
12 changed files with 376 additions and 186 deletions
--- a/9-regular-expressions/08-regexp-character-sets-and-ranges/article.md
+++ b/9-regular-expressions/08-regexp-character-sets-and-ranges/article.md
@ -22,7 +22,7 @@ So the example below gives no matches:
 alert( "Voila".match(/V[oi]la/) ); // null, no matches
 ```

-The pattern assumes:
+The pattern searches for:

 - `pattern:V`,
 - then *one* of the letters `pattern:[oi]`,
@ -42,23 +42,56 @@ In the example below we're searching for `"x"` followed by two digits or letters
 alert( "Exception 0xAF".match(/x[0-9A-F][0-9A-F]/g) ); // xAF
 ```

-Please note that in the word `subject:Exception` there's a substring `subject:xce`. It didn't match the pattern, because the letters are lowercase, while in the set `pattern:[0-9A-F]` they are uppercase.
+Here `pattern:[0-9A-F]` has two ranges: it searches for a character that is either a digit from `0` to `9` or a letter from `A` to `F`.

-If we want to find it too, then we can add a range `a-f`: `pattern:[0-9A-Fa-f]`. The `pattern:i` flag would allow lowercase too.
+If we'd like to look for lowercase letters as well, we can add the range `a-f`: `pattern:[0-9A-Fa-f]`. Or add the flag `pattern:i`.

-**Character classes are shorthands for certain character sets.**
+We can also use character classes inside `[…]`.

+For instance, if we'd like to look for a wordly character `pattern:\w` or a hyphen `pattern:-`, then the set is `pattern:[\w-]`.
+
+Combining multiple classes is also possible, e.g. `pattern:[\s\d]` means "a space character or a digit".
+
+```smart header="Character classes are shorthands for certain character sets"
 For instance:

 - **\d** -- is the same as `pattern:[0-9]`,
 - **\w** -- is the same as `pattern:[a-zA-Z0-9_]`,
- **\s** -- is the same as `pattern:[\t\n\v\f\r ]` plus few other unicode space characters.
+- **\s** -- is the same as `pattern:[\t\n\v\f\r ]`, plus few other rare unicode space characters.
+```

-We can use character classes inside `[…]` as well.
+### Example: multi-language \w

-For instance, we want to match all wordly characters or a dash, for words like "twenty-third". We can't do it with `pattern:\w+`, because `pattern:\w` class does not include a dash. But we can use `pattern:[\w-]`.
+As the character class `pattern:\w` is a shorthand for `pattern:[a-zA-Z0-9_]`, it can't find Chinese hieroglyphs, Cyrillic letters, etc.

-We also can use several classes, for example `pattern:[\s\S]` matches spaces or non-spaces -- any character. That's wider than a dot `"."`, because the dot matches any character except a newline (unless `pattern:s` flag is set).
+We can write a more universal pattern, that looks for wordly characters in any language. That's easy with unicode properties: `pattern:[\p{Alpha}\p{M}\p{Nd}\p{Pc}\p{Join_C}]`.
+
+Let's decipher it. Similar to `pattern:\w`, we're making a set of our own that includes characters with following unicode properties:
+
+- `Alphabetic` (`Alpha`) - for letters,
+- `Mark` (`M`) - for accents,
+- `Decimal_Number` (`Nd`) - for digits,
+- `Connector_Punctuation` (`Pc`) - for the underscore `'_'` and similar characters,
+- `Join_Control` (`Join_C`) - two special codes `200c` and `200d`, used in ligatures, e.g. in Arabic.
+
+An example of use:
+
+```js run
+let regexp = /[\p{Alpha}\p{M}\p{Nd}\p{Pc}\p{Join_C}]/gu;
+
+let str = `Hi 你好 12`;
+
+// finds all letters and digits:
+alert( str.match(regexp) ); // H,i,你,好,1,2
+```
+
+Of course, we can edit this pattern: add unicode properties or remove them. Unicode properties are covered in more details in the article <info:regexp-unicode>.
+
+```warn header="Unicode properties aren't supported in Edge and Firefox"
+Unicode properties `pattern:p{…}` are not yet implemented in Edge and Firefox. If we really need them, we can use library [XRegExp](http://xregexp.com/).
+
+Or just use ranges of characters in a language that interests us, e.g.  `pattern:[а-я]` for Cyrillic letters.
+```

 ## Excluding ranges

@ -78,22 +111,20 @@ The example below looks for any characters except letters, digits and spaces:
 alert( "alice15@gmail.com".match(/[^\d\sA-Z]/gi) ); // @ and .
 ```

-## No escaping in […]
+## Escaping in […]

-Usually when we want to find exactly the dot character, we need to escape it like `pattern:\.`. And if we need a backslash, then we use `pattern:\\`.
+Usually when we want to find exactly a special character, we need to escape it like `pattern:\.`. And if we need a backslash, then we use `pattern:\\`, and so on.

-In square brackets the vast majority of special characters can be used without escaping:
+In square brackets we can use the vast majority of special characters without escaping:

- A dot `pattern:'.'`.
- A plus `pattern:'+'`.
- Parentheses `pattern:'( )'`.
- Dash `pattern:'-'` in the beginning or the end (where it does not define a range).
- A caret `pattern:'^'` if not in the beginning (where it means exclusion).
- And the opening square bracket `pattern:'['`.
+- Symbols `pattern:. + ( )` never need escaping.
+- A hyphen `pattern:-` is not escaped in the beginning or the end (where it does not define a range).
+- A caret `pattern:^` is only escaped in the beginning (where it means exclusion).
+- The closing square bracket `pattern:]` is always escaped (if we need to look for that symbol).

-In other words, all special characters are allowed except where they mean something for square brackets.
+In other words, all special characters are allowed without escaping, except when they mean something for square brackets.

-A dot `"."` inside square brackets means just a dot. The pattern `pattern:[.,]` would look for one of characters: either a dot or a comma.
+A dot `.` inside square brackets means just a dot. The pattern `pattern:[.,]` would look for one of characters: either a dot or a comma.

 In the example below the regexp `pattern:[-().^+]` looks for one of the characters `-().^+`:

@ -112,3 +143,55 @@ let reg = /[\-\(\)\.\^\+]/g;

 alert( "1 + 2 - 3".match(reg) ); // also works: +, -
 ```
+
+## Ranges and flag "u"
+
+If there are surrogate pairs in the set, flag `pattern:u` is required for them to work correctly.
+
+For instance, let's look for `pattern:[𝒳𝒴]` in the string `subject:𝒳`:
+
+```js run
+alert( '𝒳'.match(/[𝒳𝒴]/) ); // shows a strange character, like [?]
+// (the search was performed incorrectly, half-character returned)
+```
+
+The result is incorrect, because by default regular expressions "don't know" about surrogate pairs.
+
+The regular expression engine thinks that `[𝒳𝒴]` -- are not two, but four characters:
+1. left half of `𝒳` `(1)`,
+2. right half of `𝒳` `(2)`,
+3. left half of `𝒴` `(3)`,
+4. right half of `𝒴` `(4)`.
+
+We can see their codes like this:
+
+```js run
+for(let i=0; i<'𝒳𝒴'.length; i++) {
+  alert('𝒳𝒴'.charCodeAt(i)); // 55349, 56499, 55349, 56500
+};
+```
+
+So, the example above finds and shows the left half of `𝒳`.
+
+If we add flag `pattern:u`, then the behavior will be correct:
+
+```js run
+alert( '𝒳'.match(/[𝒳𝒴]/u) ); // 𝒳
+```
+
+The similar situation occurs when looking for a range, such as `[𝒳-𝒴]`.
+
+If we forget to add flag `pattern:u`, there will be an error:
+
+```js run
+'𝒳'.match(/[𝒳-𝒴]/); // Error: Invalid regular expression
+```
+
+The reason is that without flag `pattern:u` surrogate pairs are perceived as two characters, so `[𝒳-𝒴]` is interpreted as `[<55349><56499>-<55349><56500>]` (every surrogate pair is replaced with its codes). Now it's easy to see that the range `56499-55349` is invalid: its starting code `56499` is greater than the end `55349`. That's the formal reason for the error.
+
+With the flag `pattern:u` the pattern works correctly:
+
+```js run
+// look for characters from 𝒳 to 𝒵
+alert( '𝒴'.match(/[𝒳-𝒵]/u) ); // 𝒴
+```