diff --git a/9-regular-expressions/01-regexp-introduction/article.md b/9-regular-expressions/01-regexp-introduction/article.md index 1632a930..5cbe011a 100644 --- a/9-regular-expressions/01-regexp-introduction/article.md +++ b/9-regular-expressions/01-regexp-introduction/article.md @@ -2,7 +2,7 @@ Regular expressions is a powerful way to search and replace in text. -In JavaScript, they are available as `RegExp` object, and also integrated in methods of strings. +In JavaScript, they are available as [RegExp](mdn:js/RegExp) object, and also integrated in methods of strings. ## Regular Expressions @@ -23,35 +23,43 @@ regexp = /pattern/; // no flags regexp = /pattern/gmi; // with flags g,m and i (to be covered soon) ``` -Slashes `"/"` tell JavaScript that we are creating a regular expression. They play the same role as quotes for strings. +Slashes `pattern:/.../` tell JavaScript that we are creating a regular expression. They play the same role as quotes for strings. -## Usage +In both cases `regexp` becomes an object of the built-in `RegExp` class. -To search inside a string, we can use method [search](mdn:js/String/search). +The main difference between these two syntaxes is that slashes `pattern:/.../` do not allow to insert expressions (like strings with `${...}`). They are fully static. -Here's an example: +Slashes are used when we know the regular expression at the code writing time -- and that's the most common situation. While `new RegExp` is used when we need to create a regexp "on the fly", from a dynamically generated string, for instance: -```js run -let str = "I love JavaScript!"; // will search here +```js +let tag = prompt("What tag do you want to find?", "h2"); -let regexp = /love/; -alert( str.search(regexp) ); // 2 +let regexp = new RegExp(`<${tag}>`); // same as /

/ if answered "h2" in the prompt above ``` -The `str.search` method looks for the pattern `pattern:/love/` and returns the position inside the string. As we might guess, `pattern:/love/` is the simplest possible pattern. What it does is a simple substring search. +## Flags -The code above is the same as: +Regular expressions may have flags that affect the search. -```js run -let str = "I love JavaScript!"; // will search here +There are only 6 of them in JavaScript: -let substr = 'love'; -alert( str.search(substr) ); // 2 -``` +`pattern:i` +: With this flag the search is case-insensitive: no difference between `A` and `a` (see the example below). -So searching for `pattern:/love/` is the same as searching for `"love"`. +`pattern:g` +: With this flag the search looks for all matches, without it -- only the first one. -But that's only for now. Soon we'll create more complex regular expressions with much more searching power. +`pattern:m` +: Multiline mode (covered in the chapter ). + +`pattern:s` +: Enables "dotall" mode, that allows a dot `pattern:.` to match newline character `\n` (covered in the chapter ). + +`pattern:u` +: Enables full unicode support. The flag enables correct processing of surrogate pairs. More about that in the chapter . + +`pattern:y` +: "Sticky" mode: searching at the exact position in the text (covered in the chapter ) ```smart header="Colors" From here on the color scheme is: @@ -61,65 +69,109 @@ From here on the color scheme is: - result -- `match:green` ``` +## Searching: str.match -````smart header="When to use `new RegExp`?" -Normally we use the short syntax `/.../`. But it does not support variable insertions `${...}`. +As it was said previously, regular expressions are integrated with string methods. -On the other hand, `new RegExp` allows to construct a pattern dynamically from a string, so it's more flexible. +The method `str.match(regexp)` finds all matches of `regexp` in the string `str`. -Here's an example of a dynamically generated regexp: +It has 3 working modes: + +1. If the regular expression has flag `pattern:g`, it returns an array of all matches: + ```js run + let str = "We will, we will rock you"; + + alert( str.match(/we/gi) ); // We,we (an array of 2 matches) + ``` + Please note that both `match:We` and `match:we` are found, because flag `pattern:i` makes the regular expression case-insensitive. + +2. If there's no such flag it returns only the first match in the form of an array, with the full match at index `0` and some additional details in properties: + ```js run + let str = "We will, we will rock you"; + + let result = str.match(/we/i); // without flag g + + alert( result[0] ); // We (1st match) + alert( result.length ); // 1 + + // Details: + alert( result.index ); // 0 (position of the match) + alert( result.input ); // We will, we will rock you (source string) + ``` + The array may have other indexes, besides `0` if a part of the regular expression is enclosed in parentheses. We'll cover that in the chapter . + +3. And, finally, if there are no matches, `null` is returned (doesn't matter if there's flag `pattern:g` or not). + + That's a very important nuance. If there are no matches, we get not an empty array, but `null`. Forgetting about that may lead to errors, e.g.: + + ```js run + let matches = "JavaScript".match(/HTML/); // = null + + if (!matches.length) { // Error: Cannot read property 'length' of null + alert("Error in the line above"); + } + ``` + + If we'd like the result to be always an array, we can write it this way: + + ```js run + let matches = "JavaScript".match(/HTML/)*!* || []*/!*; + + if (!matches.length) { + alert("No matches"); // now it works + } + ``` + +## Replacing: str.replace + +The method `str.replace(regexp, replacement)` replaces matches with `regexp` in string `str` with `replacement` (all matches, if there's flag `pattern:g`, otherwise only the first one). + +For instance: ```js run -let tag = prompt("Which tag you want to search?", "h2"); -let regexp = new RegExp(`<${tag}>`); +// no flag g +alert( "We will, we will".replace(/we/i, "I") ); // I will, we will -// finds

by default -alert( "

".search(regexp)); +// with flag g +alert( "We will, we will".replace(/we/ig, "I") ); // I will, I will ``` -```` +The second argument is the `replacement` string. We can use special character combinations in it to insert fragments of the match: -## Flags +| Symbols | Action in the replacement string | +|--------|--------| +|`$&`|inserts the whole match| +|$`|inserts a part of the string before the match| +|`$'`|inserts a part of the string after the match| +|`$n`|if `n` is a 1-2 digit number, then it inserts the contents of n-th parentheses, more about it in the chapter | +|`$`|inserts the contents of the parentheses with the given `name`, more about it in the chapter | +|`$$`|inserts character `$` | -Regular expressions may have flags that affect the search. - -There are only 6 of them in JavaScript: - -`i` -: With this flag the search is case-insensitive: no difference between `A` and `a` (see the example below). - -`g` -: With this flag the search looks for all matches, without it -- only the first one (we'll see uses in the next chapter). - -`m` -: Multiline mode (covered in the chapter ). - -`s` -: "Dotall" mode, allows `.` to match newlines (covered in the chapter ). - -`u` -: Enables full unicode support. The flag enables correct processing of surrogate pairs. More about that in the chapter . - -`y` -: Sticky mode (covered in the chapter ) - -We'll cover all these flags further in the tutorial. - -For now, the simplest flag is `i`, here's an example: +An example with `pattern:$&`: ```js run -let str = "I love JavaScript!"; - -alert( str.search(/LOVE/i) ); // 2 (found lowercased) - -alert( str.search(/LOVE/) ); // -1 (nothing found without 'i' flag) +alert( "I love HTML".replace(/HTML/, "$& and JavaScript") ); // I love HTML and JavaScript ``` -So the `i` flag already makes regular expressions more powerful than a simple substring search. But there's so much more. We'll cover other flags and features in the next chapters. +## Testing: regexp.test +The method `regexp.test(str)` looks for at least one match, if found, returns `true`, otherwise `false`. + +```js run +let str = "I love JavaScript"; +let reg = /LOVE/i; + +alert( reg.test(str) ); // true +``` + +Further in this chapter we'll study more regular expressions, come across many other examples and also meet other methods. + +Full information about the methods is given in the article . ## Summary -- A regular expression consists of a pattern and optional flags: `g`, `i`, `m`, `u`, `s`, `y`. -- Without flags and special symbols that we'll study later, the search by a regexp is the same as a substring search. -- The method `str.search(regexp)` returns the index where the match is found or `-1` if there's no match. In the next chapter we'll see other methods. +- A regular expression consists of a pattern and optional flags: `pattern:g`, `pattern:i`, `pattern:m`, `pattern:u`, `pattern:s`, `pattern:y`. +- Without flags and special symbols that we'll study later, the search by a regexp is the same as a substring search. +- The method `str.match(regexp)` looks for matches: all of them if there's `pattern:g` flag, otherwise only the first one. +- The method `str.replace(regexp, replacement)` replaces matches with `regexp` by `replacement`: all of them if there's `pattern:g` flag, otherwise only the first one. +- The method `regexp.test(str)` returns `true` if there's at least one match, otherwise `false`. diff --git a/9-regular-expressions/02-regexp-character-classes/article.md b/9-regular-expressions/02-regexp-character-classes/article.md new file mode 100644 index 00000000..881b6ba2 --- /dev/null +++ b/9-regular-expressions/02-regexp-character-classes/article.md @@ -0,0 +1,189 @@ +# Character classes + +Consider a practical task -- we have a phone number like `"+7(903)-123-45-67"`, and we need to turn it into pure numbers: `79035419441`. + +To do so, we can find and remove anything that's not a number. Character classes can help with that. + +A *character class* is a special notation that matches any symbol from a certain set. + +For the start, let's explore the "digit" class. It's written as `pattern:\d` and corresponds to "any single digit". + +For instance, the let's find the first digit in the phone number: + +```js run +let str = "+7(903)-123-45-67"; + +let reg = /\d/; + +alert( str.match(reg) ); // 7 +``` + +Without the flag `pattern:g`, the regular expression only looks for the first match, that is the first digit `pattern:\d`. + +Let's add the `pattern:g` flag to find all digits: + +```js run +let str = "+7(903)-123-45-67"; + +let reg = /\d/g; + +alert( str.match(reg) ); // array of matches: 7,9,0,3,1,2,3,4,5,6,7 + +// let's make the digits-only phone number of them: +alert( str.match(reg).join('') ); // 79035419441 +``` + +That was a character class for digits. There are other character classes as well. + +Most used are: + +`pattern:\d` ("d" is from "digit") +: A digit: a character from `0` to `9`. + +`pattern:\s` ("s" is from "space") +: A space symbol: includes spaces, tabs `\t`, newlines `\n` and few other rare characters: `\v`, `\f` and `\r`. + +`pattern:\w` ("w" is from "word") +: A "wordly" character: either a letter of Latin alphabet or a digit or an underscore `_`. Non-Latin letters (like cyrillic or hindi) do not belong to `pattern:\w`. + +For instance, `pattern:\d\s\w` means a "digit" followed by a "space character" followed by a "wordly character", such as `match:1 a`. + +**A regexp may contain both regular symbols and character classes.** + +For instance, `pattern:CSS\d` matches a string `match:CSS` with a digit after it: + +```js run +let str = "Is there CSS4?"; +let reg = /CSS\d/ + +alert( str.match(reg) ); // CSS4 +``` + +Also we can use many character classes: + +```js run +alert( "I love HTML5!".match(/\s\w\w\w\w\d/) ); // ' HTML5' +``` + +The match (each regexp character class has the corresponding result character): + +![](love-html5-classes.svg) + +## Inverse classes + +For every character class there exists an "inverse class", denoted with the same letter, but uppercased. + +The "inverse" means that it matches all other characters, for instance: + +`pattern:\D` +: Non-digit: any character except `pattern:\d`, for instance a letter. + +`pattern:\S` +: Non-space: any character except `pattern:\s`, for instance a letter. + +`pattern:\W` +: Non-wordly character: anything but `pattern:\w`, e.g a non-latin letter or a space. + +In the beginning of the chapter we saw how to make a number-only phone number from a string like `subject:+7(903)-123-45-67`: find all digits and join them. + +```js run +let str = "+7(903)-123-45-67"; + +alert( str.match(/\d/g).join('') ); // 79031234567 +``` + +An alternative, shorter way is to find non-digits `pattern:\D` and remove them from the string: + +```js run +let str = "+7(903)-123-45-67"; + +alert( str.replace(/\D/g, "") ); // 79031234567 +``` + +## A dot is any character + +A dot `pattern:.` is a special character class that matches "any character except a newline". + +For instance: + +```js run +alert( "Z".match(/./) ); // Z +``` + +Or in the middle of a regexp: + +```js run +let reg = /CS.4/; + +alert( "CSS4".match(reg) ); // CSS4 +alert( "CS-4".match(reg) ); // CS-4 +alert( "CS 4".match(reg) ); // CS 4 (space is also a character) +``` + +Please note that a dot means "any character", but not the "absense of a character". There must be a character to match it: + +```js run +alert( "CS4".match(/CS.4/) ); // null, no match because there's no character for the dot +``` + +### Dot as literally any character with "s" flag + +Usually a dot doesn't match a newline character `\n`. + +For instance, the regexp `pattern:A.B` matches `match:A`, and then `match:B` with any character between them, except a newline `\n`: + +```js run +alert( "A\nB".match(/A.B/) ); // null (no match) +``` + +There are many situations when we'd like a dot to mean literally "any character", newline included. + +That's what flag `pattern:s` does. If a regexp has it, then a dot `pattern:.` matches literally any character: + +```js run +alert( "A\nB".match(/A.B/s) ); // A\nB (match!) +``` + +````warn header="Pay attention to spaces" +Usually we pay little attention to spaces. For us strings `subject:1-5` and `subject:1 - 5` are nearly identical. + +But if a regexp doesn't take spaces into account, it may fail to work. + +Let's try to find digits separated by a hyphen: + +```js run +alert( "1 - 5".match(/\d-\d/) ); // null, no match! +``` + +Let's fix it adding spaces into the regexp `pattern:\d - \d`: + +```js run +alert( "1 - 5".match(/\d - \d/) ); // 1 - 5, now it works +// or we can use \s class: +alert( "1 - 5".match(/\d\s-\s\d/) ); // 1 - 5, also works +``` + +**A space is a character. Equal in importance with any other character.** + +We can't add or remove spaces from a regular expression and expect to work the same. + +In other words, in a regular expression all characters matter, spaces too. +```` + +## Summary + +There exist following character classes: + +- `pattern:\d` -- digits. +- `pattern:\D` -- non-digits. +- `pattern:\s` -- space symbols, tabs, newlines. +- `pattern:\S` -- all but `pattern:\s`. +- `pattern:\w` -- Latin letters, digits, underscore `'_'`. +- `pattern:\W` -- all but `pattern:\w`. +- `pattern:.` -- any character if with the regexp `'s'` flag, otherwise any except a newline `\n`. + +...But that's not all! + +Unicode encoding, used by JavaScript for strings, provides many properties for characters, like: which language the letter belongs to (if it's a letter) it is it a punctuation sign, etc. + +We can search by these properties as well. That requires flag `pattern:u`, covered in the next article. diff --git a/9-regular-expressions/03-regexp-character-classes/love-html5-classes.svg b/9-regular-expressions/02-regexp-character-classes/love-html5-classes.svg similarity index 100% rename from 9-regular-expressions/03-regexp-character-classes/love-html5-classes.svg rename to 9-regular-expressions/02-regexp-character-classes/love-html5-classes.svg diff --git a/9-regular-expressions/03-regexp-character-classes/1-find-time-hh-mm/solution.md b/9-regular-expressions/03-regexp-character-classes/1-find-time-hh-mm/solution.md deleted file mode 100644 index 829eda13..00000000 --- a/9-regular-expressions/03-regexp-character-classes/1-find-time-hh-mm/solution.md +++ /dev/null @@ -1,6 +0,0 @@ - -The answer: `pattern:\b\d\d:\d\d\b`. - -```js run -alert( "Breakfast at 09:00 in the room 123:456.".match( /\b\d\d:\d\d\b/ ) ); // 09:00 -``` diff --git a/9-regular-expressions/03-regexp-character-classes/1-find-time-hh-mm/task.md b/9-regular-expressions/03-regexp-character-classes/1-find-time-hh-mm/task.md deleted file mode 100644 index 5e32b9c4..00000000 --- a/9-regular-expressions/03-regexp-character-classes/1-find-time-hh-mm/task.md +++ /dev/null @@ -1,8 +0,0 @@ -# Find the time - -The time has a format: `hours:minutes`. Both hours and minutes has two digits, like `09:00`. - -Make a regexp to find time in the string: `subject:Breakfast at 09:00 in the room 123:456.` - -P.S. In this task there's no need to check time correctness yet, so `25:99` can also be a valid result. -P.P.S. The regexp shouldn't match `123:456`. diff --git a/9-regular-expressions/03-regexp-character-classes/article.md b/9-regular-expressions/03-regexp-character-classes/article.md deleted file mode 100644 index 8e18df91..00000000 --- a/9-regular-expressions/03-regexp-character-classes/article.md +++ /dev/null @@ -1,270 +0,0 @@ -# Character classes - -Consider a practical task -- we have a phone number `"+7(903)-123-45-67"`, and we need to turn it into pure numbers: `79035419441`. - -To do so, we can find and remove anything that's not a number. Character classes can help with that. - -A character class is a special notation that matches any symbol from a certain set. - -For the start, let's explore a "digit" class. It's written as `\d`. We put it in the pattern, that means "any single digit". - -For instance, the let's find the first digit in the phone number: - -```js run -let str = "+7(903)-123-45-67"; - -let reg = /\d/; - -alert( str.match(reg) ); // 7 -``` - -Without the flag `g`, the regular expression only looks for the first match, that is the first digit `\d`. - -Let's add the `g` flag to find all digits: - -```js run -let str = "+7(903)-123-45-67"; - -let reg = /\d/g; - -alert( str.match(reg) ); // array of matches: 7,9,0,3,1,2,3,4,5,6,7 - -alert( str.match(reg).join('') ); // 79035419441 -``` - -That was a character class for digits. There are other character classes as well. - -Most used are: - -`\d` ("d" is from "digit") -: A digit: a character from `0` to `9`. - -`\s` ("s" is from "space") -: A space symbol: that includes spaces, tabs, newlines. - -`\w` ("w" is from "word") -: A "wordly" character: either a letter of English alphabet or a digit or an underscore. Non-Latin letters (like cyrillic or hindi) do not belong to `\w`. - -For instance, `pattern:\d\s\w` means a "digit" followed by a "space character" followed by a "wordly character", like `"1 a"`. - -**A regexp may contain both regular symbols and character classes.** - -For instance, `pattern:CSS\d` matches a string `match:CSS` with a digit after it: - -```js run -let str = "CSS4 is cool"; -let reg = /CSS\d/ - -alert( str.match(reg) ); // CSS4 -``` - -Also we can use many character classes: - -```js run -alert( "I love HTML5!".match(/\s\w\w\w\w\d/) ); // ' HTML5' -``` - -The match (each character class corresponds to one result character): - -![](love-html5-classes.svg) - -## Word boundary: \b - -A word boundary `pattern:\b` -- is a special character class. - -It does not denote a character, but rather a boundary between characters. - -For instance, `pattern:\bJava\b` matches `match:Java` in the string `subject:Hello, Java!`, but not in the script `subject:Hello, JavaScript!`. - -```js run -alert( "Hello, Java!".match(/\bJava\b/) ); // Java -alert( "Hello, JavaScript!".match(/\bJava\b/) ); // null -``` - -The boundary has "zero width" in a sense that usually a character class means a character in the result (like a wordly character or a digit), but not in this case. - -The boundary is a test. - -When regular expression engine is doing the search, it's moving along the string in an attempt to find the match. At each string position it tries to find the pattern. - -When the pattern contains `pattern:\b`, it tests that the position in string is a word boundary, that is one of three variants: - -There are three different positions that qualify as word boundaries: - -- At string start, if the first string character is a word character `\w`. -- Between two characters in the string, where one is a word character `\w` and the other is not. -- At string end, if the last string character is a word character `\w`. - -For instance, in the string `subject:Hello, Java!` the following positions match `\b`: - -![](hello-java-boundaries.svg) - -So it matches `pattern:\bHello\b`, because: - -1. At the beginning of the string the first `\b` test matches. -2. Then the word `Hello` matches. -3. Then `\b` matches, as we're between `o` (a word character) and a space (not a word character). - -Pattern `pattern:\bJava\b` also matches. But not `pattern:\bHell\b` (because there's no word boundary after `l`) and not `Java!\b` (because the exclamation sign is not a wordly character, so there's no word boundary after it). - -```js run -alert( "Hello, Java!".match(/\bHello\b/) ); // Hello -alert( "Hello, Java!".match(/\bJava\b/) ); // Java -alert( "Hello, Java!".match(/\bHell\b/) ); // null (no match) -alert( "Hello, Java!".match(/\bJava!\b/) ); // null (no match) -``` - -Once again let's note that `pattern:\b` makes the searching engine to test for the boundary, so that `pattern:Java\b` finds `match:Java` only when followed by a word boundary, but it does not add a letter to the result. - -Usually we use `\b` to find standalone English words. So that if we want `"Java"` language then `pattern:\bJava\b` finds exactly a standalone word and ignores it when it's a part of another word, e.g. it won't match `match:Java` in `subject:JavaScript`. - -Another example: a regexp `pattern:\b\d\d\b` looks for standalone two-digit numbers. In other words, it requires that before and after `pattern:\d\d` must be a symbol different from `\w` (or beginning/end of the string). - -```js run -alert( "1 23 456 78".match(/\b\d\d\b/g) ); // 23,78 -``` - -```warn header="Word boundary doesn't work for non-Latin alphabets" -The word boundary check `\b` tests for a boundary between `\w` and something else. But `\w` means an English letter (or a digit or an underscore), so the test won't work for other characters (like cyrillic or hieroglyphs). - -Later we'll come by Unicode character classes that allow to solve the similar task for different languages. -``` - - -## Inverse classes - -For every character class there exists an "inverse class", denoted with the same letter, but uppercased. - -The "reverse" means that it matches all other characters, for instance: - -`\D` -: Non-digit: any character except `\d`, for instance a letter. - -`\S` -: Non-space: any character except `\s`, for instance a letter. - -`\W` -: Non-wordly character: anything but `\w`. - -`\B` -: Non-boundary: a test reverse to `\b`. - -In the beginning of the chapter we saw how to get all digits from the phone `subject:+7(903)-123-45-67`. - -One way was to match all digits and join them: - -```js run -let str = "+7(903)-123-45-67"; - -alert( str.match(/\d/g).join('') ); // 79031234567 -``` - -An alternative, shorter way is to find non-digits `\D` and remove them from the string: - - -```js run -let str = "+7(903)-123-45-67"; - -alert( str.replace(/\D/g, "") ); // 79031234567 -``` - -## Spaces are regular characters - -Usually we pay little attention to spaces. For us strings `subject:1-5` and `subject:1 - 5` are nearly identical. - -But if a regexp doesn't take spaces into account, it may fail to work. - -Let's try to find digits separated by a dash: - -```js run -alert( "1 - 5".match(/\d-\d/) ); // null, no match! -``` - -Here we fix it by adding spaces into the regexp `pattern:\d - \d`: - -```js run -alert( "1 - 5".match(/\d - \d/) ); // 1 - 5, now it works -``` - -**A space is a character. Equal in importance with any other character.** - -Of course, spaces in a regexp are needed only if we look for them. Extra spaces (just like any other extra characters) may prevent a match: - -```js run -alert( "1-5".match(/\d - \d/) ); // null, because the string 1-5 has no spaces -``` - -In other words, in a regular expression all characters matter, spaces too. - -## A dot is any character - -The dot `"."` is a special character class that matches "any character except a newline". - -For instance: - -```js run -alert( "Z".match(/./) ); // Z -``` - -Or in the middle of a regexp: - -```js run -let reg = /CS.4/; - -alert( "CSS4".match(reg) ); // CSS4 -alert( "CS-4".match(reg) ); // CS-4 -alert( "CS 4".match(reg) ); // CS 4 (space is also a character) -``` - -Please note that the dot means "any character", but not the "absense of a character". There must be a character to match it: - -```js run -alert( "CS4".match(/CS.4/) ); // null, no match because there's no character for the dot -``` - -### The dotall "s" flag - -Usually a dot doesn't match a newline character. - -For instance, `pattern:A.B` matches `match:A`, and then `match:B` with any character between them, except a newline. - -This doesn't match: - -```js run -alert( "A\nB".match(/A.B/) ); // null (no match) - -// a space character would match, or a letter, but not \n -``` - -Sometimes it's inconvenient, we really want "any character", newline included. - -That's what `s` flag does. If a regexp has it, then the dot `"."` match literally any character: - -```js run -alert( "A\nB".match(/A.B/s) ); // A\nB (match!) -``` - -## Summary - -There exist following character classes: - -- `pattern:\d` -- digits. -- `pattern:\D` -- non-digits. -- `pattern:\s` -- space symbols, tabs, newlines. -- `pattern:\S` -- all but `pattern:\s`. -- `pattern:\w` -- English letters, digits, underscore `'_'`. -- `pattern:\W` -- all but `pattern:\w`. -- `pattern:.` -- any character if with the regexp `'s'` flag, otherwise any except a newline. - -...But that's not all! - -The Unicode encoding, used by JavaScript for strings, provides many properties for characters, like: which language the letter belongs to (if a letter) it is it a punctuation sign, etc. - -Modern JavaScript allows to use these properties in regexps to look for characters, for instance: - -- A cyrillic letter is: `pattern:\p{Script=Cyrillic}` or `pattern:\p{sc=Cyrillic}`. -- A dash (be it a small hyphen `-` or a long dash `—`): `pattern:\p{Dash_Punctuation}` or `pattern:\p{pd}`. -- A currency symbol, such as `$`, `€` or another: `pattern:\p{Currency_Symbol}` or `pattern:\p{sc}`. -- ...And much more. Unicode has a lot of character categories that we can select from. - -These patterns require `'u'` regexp flag to work. More about that in the chapter [](info:regexp-unicode). diff --git a/9-regular-expressions/03-regexp-unicode/article.md b/9-regular-expressions/03-regexp-unicode/article.md new file mode 100644 index 00000000..7a14621b --- /dev/null +++ b/9-regular-expressions/03-regexp-unicode/article.md @@ -0,0 +1,167 @@ +# Unicode: flag "u" and class \p{...} + +JavaScript uses [Unicode encoding](https://en.wikipedia.org/wiki/Unicode) for strings. Most characters are encoding with 2 bytes, but that allows to represent at most 65536 characters. + +That range is not big enough to encode all possible characters, that's why some rare characters are encoded with 4 bytes, for instance like `𝒳` (mathematical X) or `😄` (a smile), some hieroglyphs and so on. + +Here are the unicode values of some characters: + +| Character | Unicode | Bytes count in unicode | +|------------|---------|--------| +| a | `0x0061` | 2 | +| ≈ | `0x2248` | 2 | +|𝒳| `0x1d4b3` | 4 | +|𝒴| `0x1d4b4` | 4 | +|😄| `0x1f604` | 4 | + +So characters like `a` and `≈` occupy 2 bytes, while codes for `𝒳`, `𝒴` and `😄` are longer, they have 4 bytes. + +Long time ago, when JavaScript language was created, Unicode encoding was simpler: there were no 4-byte characters. So, some language features still handle them incorrectly. + +For instance, `length` thinks that here are two characters: + +```js run +alert('😄'.length); // 2 +alert('𝒳'.length); // 2 +``` + +...But we can see that there's only one, right? The point is that `length` treats 4 bytes as two 2-byte characters. That's incorrect, because they must be considered only together (so-called "surrogate pair", you can read about them in the article ). + +By default, regular expressions also treat 4-byte "long characters" as a pair of 2-byte ones. And, as it happens with strings, that may lead to odd results. We'll see that a bit later, in the article . + +Unlike strings, regular expressions have flag `pattern:u` that fixes such problems. With such flag, a regexp handles 4-byte characters correctly. And also Unicode property search becomes available, we'll get to it next. + +## Unicode properties \p{...} + +```warn header="Not supported in Firefox and Edge" +Despite being a part of the standard since 2018, unicode proeprties are not supported in Firefox ([bug](https://bugzilla.mozilla.org/show_bug.cgi?id=1361876)) and Edge ([bug](https://github.com/Microsoft/ChakraCore/issues/2969)). + +There's [XRegExp](http://xregexp.com) library that provides "extended" regular expressions with cross-browser support for unicode properties. +``` + +Every character in Unicode has a lot of properties. They describe what "category" the character belongs to, contain miscellaneous information about it. + +For instance, if a character has `Letter` property, it means that the character belongs to an alphabet (of any language). And `Number` property means that it's a digit: maybe Arabic or Chinese, and so on. + +We can search for characters with a property, written as `pattern:\p{…}`. To use `pattern:\p{…}`, a regular expression must have flag `pattern:u`. + +For instance, `\p{Letter}` denotes a letter in any of language. We can also use `\p{L}`, as `L` is an alias of `Letter`. There are shorter aliases for almost every property. + +In the example below three kinds of letters will be found: English, Georgean and Korean. + +```js run +let str = "A ბ ㄱ"; + +alert( str.match(/\p{L}/gu) ); // A,ბ,ㄱ +alert( str.match(/\p{L}/g) ); // null (no matches, as there's no flag "u") +``` + +Here's the main character categories and their subcategories: + +- Letter `L`: + - lowercase `Ll` + - modifier `Lm`, + - titlecase `Lt`, + - uppercase `Lu`, + - other `Lo`. +- Number `N`: + - decimal digit `Nd`, + - letter number `Nl`, + - other `No`. +- Punctuation `P`: + - connector `Pc`, + - dash `Pd`, + - initial quote `Pi`, + - final quote `Pf`, + - open `Ps`, + - close `Pe`, + - other `Po`. +- Mark `M` (accents etc): + - spacing combining `Mc`, + - enclosing `Me`, + - non-spacing `Mn`. +- Symbol `S`: + - currency `Sc`, + - modifier `Sk`, + - math `Sm`, + - other `So`. +- Separator `Z`: + - line `Zl`, + - paragraph `Zp`, + - space `Zs`. +- Other `C`: + - control `Cc`, + - format `Cf`, + - not assigned `Cn`, + -- private use `Co`, + - surrogate `Cs`. + + +So, e.g. if we need letters in lower case, we can write `pattern:\p{Ll}`, punctuation signs: `pattern:\p{P}` and so on. + +There are also other derived categories, like: +- `Alphabetic` (`Alpha`), includes Letters `L`, plus letter numbers `Nl` (e.g. Ⅻ - a character for the roman number 12), plus some other symbols `Other_Alphabetic` (`OAlpha`). +- `Hex_Digit` includes hexadecimal digits: `0-9`, `a-f`. +- ...And so on. + +Unicode supports many different properties, their full list would require a lot of space, so here are the references: + +- List all properties by a character: . +- List all characters by a property: . +- Short aliases for properties: . +- A full base of Unicode characters in text format, with all properties, is here: . + +### Example: hexadecimal numbers + +For instance, let's look for hexadecimal numbers, written as `xFF`, where `F` is a hex digit (0..1 or A..F). + +A hex digit can be denoted as `pattern:\p{Hex_Digit}`: + +```js run +let reg = /x\p{Hex_Digit}\p{Hex_Digit}/u; + +alert("number: xAF".match(reg)); // xAF +``` + +### Example: Chinese hieroglyphs + +Let's look for Chinese hieroglyphs. + +There's a unicode property `Script` (a writing system), that may have a value: `Cyrillic`, `Greek`, `Arabic`, `Han` (Chinese) and so on, [here's the full list]("https://en.wikipedia.org/wiki/Script_(Unicode)"). + +To look for characters in a given writing system we should use `pattern:Script=`, e.g. for Cyrillic letters: `pattern:\p{sc=Cyrillic}`, for Chinese hieroglyphs: `pattern:\p{sc=Han}`, and so on: + +```js run +let regexp = /\p{sc=Han}/gu; // returns Chinese hieroglyphs + +let str = `Hello Привет 你好 123_456`; + +alert( str.match(regexp) ); // 你,好 +``` + +### Example: currency + +Characters that denote a currency, such as `$`, `€`, `¥`, have unicode property `pattern:\p{Currency_Symbol}`, the short alias: `pattern:\p{Sc}`. + +Let's use it to look for prices in the format "currency, followed by a digit": + +```js run +let regexp = /\p{Sc}\d/gu; + +let str = `Prices: $2, €1, ¥9`; + +alert( str.match(regexp) ); // $2,€1,¥9 +``` + +Later, in the article we'll see how to look for numbers that contain many digits. + +## Summary + +Flag `pattern:u` enables the support of Unicode in regular expressions. + +That means two things: + +1. Characters of 4 bytes are handled correctly: as a single character, not two 2-byte characters. +2. Unicode properties can be used in the search: `\p{…}`. + +With Unicode properties we can look for words in given languages, special characters (quotes, currencies) and so on. diff --git a/9-regular-expressions/12-regexp-anchors/1-start-end/solution.md b/9-regular-expressions/04-regexp-anchors/1-start-end/solution.md similarity index 77% rename from 9-regular-expressions/12-regexp-anchors/1-start-end/solution.md rename to 9-regular-expressions/04-regexp-anchors/1-start-end/solution.md index 1a8cbe9a..702f992d 100644 --- a/9-regular-expressions/12-regexp-anchors/1-start-end/solution.md +++ b/9-regular-expressions/04-regexp-anchors/1-start-end/solution.md @@ -1,5 +1,4 @@ - -The empty string is the only match: it starts and immediately finishes. +An empty string is the only match: it starts and immediately finishes. The task once again demonstrates that anchors are not characters, but tests. diff --git a/9-regular-expressions/12-regexp-anchors/1-start-end/task.md b/9-regular-expressions/04-regexp-anchors/1-start-end/task.md similarity index 100% rename from 9-regular-expressions/12-regexp-anchors/1-start-end/task.md rename to 9-regular-expressions/04-regexp-anchors/1-start-end/task.md diff --git a/9-regular-expressions/04-regexp-anchors/article.md b/9-regular-expressions/04-regexp-anchors/article.md new file mode 100644 index 00000000..c34999ee --- /dev/null +++ b/9-regular-expressions/04-regexp-anchors/article.md @@ -0,0 +1,52 @@ +# Anchors: string start ^ and end $ + +The caret `pattern:^` and dollar `pattern:$` characters have special meaning in a regexp. They are called "anchors". + +The caret `pattern:^` matches at the beginning of the text, and the dollar `pattern:$` -- at the end. + +For instance, let's test if the text starts with `Mary`: + +```js run +let str1 = "Mary had a little lamb"; +alert( /^Mary/.test(str1) ); // true +``` + +The pattern `pattern:^Mary` means: "string start and then Mary". + +Similar to this, we can test if the string ends with `snow` using `pattern:snow$`: + +```js run +let str1 = "it's fleece was white as snow"; +alert( /snow$/.test(str1) ); // true +``` + +In these particular cases we could use string methods `startsWith/endsWith` instead. Regular expressions should be used for more complex tests. + +## Testing for a full match + +Both anchors together `pattern:^...$` are often used to test whether or not a string fully matches the pattern. For instance, to check if the user input is in the right format. + +Let's check whether or not a string is a time in `12:34` format. That is: two digits, then a colon, and then another two digits. + +In regular expressions language that's `pattern:\d\d:\d\d`: + +```js run +let goodInput = "12:34"; +let badInput = "12:345"; + +let regexp = /^\d\d:\d\d$/; +alert( regexp.test(goodInput) ); // true +alert( regexp.test(badInput) ); // false +``` + +Here the match for `pattern:\d\d:\d\d` must start exactly after the beginning of the text `pattern:^`, and the end `pattern:$` must immediately follow. + +The whole string must be exactly in this format. If there's any deviation or an extra character, the result is `false`. + +Anchors behave differently if flag `pattern:m` is present. We'll see that in the next article. + +```smart header="Anchors have \"zero width\"" +Anchors `pattern:^` and `pattern:$` are tests. They have zero width. + +In other words, they do not match a character, but rather force the regexp engine to check the condition (text start/end). +``` diff --git a/9-regular-expressions/05-regexp-multiline-mode/article.md b/9-regular-expressions/05-regexp-multiline-mode/article.md new file mode 100644 index 00000000..321218b3 --- /dev/null +++ b/9-regular-expressions/05-regexp-multiline-mode/article.md @@ -0,0 +1,87 @@ +# Multiline mode of anchors ^ $, flag "m" + +The multiline mode is enabled by the flag `pattern:m`. + +It only affects the behavior of `pattern:^` and `pattern:$`. + +In the multiline mode they match not only at the beginning and the end of the string, but also at start/end of line. + +## Searching at line start ^ + +In the example below the text has multiple lines. The pattern `pattern:/^\d/gm` takes a digit from the beginning of each line: + +```js run +let str = `1st place: Winnie +2nd place: Piglet +3rd place: Eeyore`; + +*!* +alert( str.match(/^\d/gm) ); // 1, 2, 3 +*/!* +``` + +Without the flag `pattern:m` only the first digit is matched: + +```js run +let str = `1st place: Winnie +2nd place: Piglet +3rd place: Eeyore`; + +*!* +alert( str.match(/^\d/g) ); // 1 +*/!* +``` + +That's because by default a caret `pattern:^` only matches at the beginning of the text, and in the multiline mode -- at the start of any line. + +```smart +"Start of a line" formally means "immediately after a line break": the test `pattern:^` in multiline mode matches at all positions preceeded by a newline character `\n`. + +And at the text start. +``` + +## Searching at line end $ + +The dollar sign `pattern:$` behaves similarly. + +The regular expression `pattern:\d$` finds the last digit in every line + +```js run +let str = `Winnie: 1 +Piglet: 2 +Eeyore: 3`; + +alert( str.match(/\d$/gm) ); // 1,2,3 +``` + +Without the flag `m`, the dollar `pattern:$` would only match the end of the whole text, so only the very last digit would be found. + +```smart +"End of a line" formally means "immediately before a line break": the test `pattern:^` in multiline mode matches at all positions succeeded by a newline character `\n`. + +And at the text end. +``` + +## Searching for \n instead of ^ $ + +To find a newline, we can use not only anchors `pattern:^` and `pattern:$`, but also the newline character `\n`. + +What's the difference? Let's see an example. + +Here we search for `pattern:\d\n` instead of `pattern:\d$`: + +```js run +let str = `Winnie: 1 +Piglet: 2 +Eeyore: 3`; + +alert( str.match(/\d\n/gm) ); // 1\n,2\n +``` + +As we can see, there are 2 matches instead of 3. + +That's because there's no newline after `subject:3` (there's text end though, so it matches `pattern:$`). + +Another difference: now every match includes a newline character `match:\n`. Unlike the anchors `pattern:^` `pattern:$`, that only test the condition (start/end of a line), `\n` is a character, so it becomes a part of the result. + +So, a `\n` in the pattern is used when we need newline characters in the result, while anchors are used to find something at the beginning/end of a line. diff --git a/9-regular-expressions/06-regexp-boundary/1-find-time-hh-mm/solution.md b/9-regular-expressions/06-regexp-boundary/1-find-time-hh-mm/solution.md new file mode 100644 index 00000000..d378d4c9 --- /dev/null +++ b/9-regular-expressions/06-regexp-boundary/1-find-time-hh-mm/solution.md @@ -0,0 +1,6 @@ + +Ответ: `pattern:\b\d\d:\d\d\b`. + +```js run +alert( "Завтрак в 09:00 в комнате 123:456.".match( /\b\d\d:\d\d\b/ ) ); // 09:00 +``` diff --git a/9-regular-expressions/06-regexp-boundary/1-find-time-hh-mm/task.md b/9-regular-expressions/06-regexp-boundary/1-find-time-hh-mm/task.md new file mode 100644 index 00000000..16330a6d --- /dev/null +++ b/9-regular-expressions/06-regexp-boundary/1-find-time-hh-mm/task.md @@ -0,0 +1,9 @@ +# Найдите время + +Время имеет формат: `часы:минуты`. И часы, и минуты имеют две цифры, например, `09:00`. + +Введите регулярное выражение, чтобы найти время в строке: `subject:Завтрак в 09:00 в комнате 123:456.` + +P.S. В этой задаче пока нет необходимости проверять правильность времени, поэтому `25:99` также может быть верным результатом. + +P.P.S. Регулярное выражение не должно находить `123:456`. diff --git a/9-regular-expressions/06-regexp-boundary/article.md b/9-regular-expressions/06-regexp-boundary/article.md new file mode 100644 index 00000000..286a963e --- /dev/null +++ b/9-regular-expressions/06-regexp-boundary/article.md @@ -0,0 +1,53 @@ +# Word boundary: \b + +A word boundary `pattern:\b` is a test, just like `pattern:^` and `pattern:$`. + +When the regexp engine (program module that implements searching for regexps) comes across `pattern:\b`, it checks that the position in the string is a word boundary. + +There are three different positions that qualify as word boundaries: + +- At string start, if the first string character is a word character `pattern:\w`. +- Between two characters in the string, where one is a word character `pattern:\w` and the other is not. +- At string end, if the last string character is a word character `pattern:\w`. + +For instance, regexp `pattern:\bJava\b` will be found in `subject:Hello, Java!`, where `subject:Java` is a standalone word, but not in `subject:Hello, JavaScript!`. + +```js run +alert( "Hello, Java!".match(/\bJava\b/) ); // Java +alert( "Hello, JavaScript!".match(/\bJava\b/) ); // null +``` + +In the string `subject:Hello, Java!` following positions correspond to `pattern:\b`: + +![](hello-java-boundaries.svg) + +So, it matches the pattern `pattern:\bHello\b`, because: + +1. At the beginning of the string matches the first test `pattern:\b`. +2. Then matches the word `pattern:Hello`. +3. Then the test `pattern:\b` - matches again, as we're between `subject:o` and a space. + +Шаблон `pattern:\bJava\b` также совпадёт. Но не `pattern:\bHell\b` (потому что после `subject:l` нет границы слова), и не `pattern:Java!\b` (восклицательный знак не является "символом слова" `pattern:\w`, поэтому после него нет границы слова). + +```js run +alert( "Hello, Java!".match(/\bHello\b/) ); // Hello +alert( "Hello, Java!".match(/\bJava\b/) ); // Java +alert( "Hello, Java!".match(/\bHell\b/) ); // null (нет совпадения) +alert( "Hello, Java!".match(/\bJava!\b/) ); // null (нет совпадения) +``` + +Так как `pattern:\b` является проверкой, то не добавляет символ после границы к результату. + +Мы можем использовать `pattern:\b` не только со словами, но и с цифрами. + +Например, регулярное выражение `pattern:\b\d\d\b` ищет отдельно стоящие двузначные числа. Другими словами, оно требует, чтобы до и после `pattern:\d\d` был символ, отличный от `pattern:\w` (или начало/конец строки) + +```js run +alert( "1 23 456 78".match(/\b\d\d\b/g) ); // 23,78 +``` + +```warn header="Граница слова `pattern:\b` не работает для алфавитов, не основанных на латинице" +Проверка границы слова `pattern:\b` проверяет границу, должно быть `pattern:\w` с одной стороны и "не `pattern:\w`" - с другой. + +Но `pattern:\w` означает латинскую букву (или цифру или знак подчёркивания), поэтому проверка не будет работать для других символов (например, кириллицы или иероглифов). +``` diff --git a/9-regular-expressions/03-regexp-character-classes/hello-java-boundaries.svg b/9-regular-expressions/06-regexp-boundary/hello-java-boundaries.svg similarity index 100% rename from 9-regular-expressions/03-regexp-character-classes/hello-java-boundaries.svg rename to 9-regular-expressions/06-regexp-boundary/hello-java-boundaries.svg diff --git a/9-regular-expressions/04-regexp-escaping/article.md b/9-regular-expressions/07-regexp-escaping/article.md similarity index 96% rename from 9-regular-expressions/04-regexp-escaping/article.md rename to 9-regular-expressions/07-regexp-escaping/article.md index 909cd485..cd118010 100644 --- a/9-regular-expressions/04-regexp-escaping/article.md +++ b/9-regular-expressions/07-regexp-escaping/article.md @@ -75,7 +75,7 @@ The quotes "consume" backslashes and interpret them, for instance: - `\n` -- becomes a newline character, - `\u1234` -- becomes the Unicode character with such code, -- ...And when there's no special meaning: like `\d` or `\z`, then the backslash is simply removed. +- ...And when there's no special meaning: like `pattern:\d` or `\z`, then the backslash is simply removed. So the call to `new RegExp` gets a string without backslashes. That's why the search doesn't work! diff --git a/9-regular-expressions/05-regexp-character-sets-and-ranges/1-find-range-1/solution.md b/9-regular-expressions/08-regexp-character-sets-and-ranges/1-find-range-1/solution.md similarity index 100% rename from 9-regular-expressions/05-regexp-character-sets-and-ranges/1-find-range-1/solution.md rename to 9-regular-expressions/08-regexp-character-sets-and-ranges/1-find-range-1/solution.md diff --git a/9-regular-expressions/05-regexp-character-sets-and-ranges/1-find-range-1/task.md b/9-regular-expressions/08-regexp-character-sets-and-ranges/1-find-range-1/task.md similarity index 100% rename from 9-regular-expressions/05-regexp-character-sets-and-ranges/1-find-range-1/task.md rename to 9-regular-expressions/08-regexp-character-sets-and-ranges/1-find-range-1/task.md diff --git a/9-regular-expressions/05-regexp-character-sets-and-ranges/2-find-time-2-formats/solution.md b/9-regular-expressions/08-regexp-character-sets-and-ranges/2-find-time-2-formats/solution.md similarity index 100% rename from 9-regular-expressions/05-regexp-character-sets-and-ranges/2-find-time-2-formats/solution.md rename to 9-regular-expressions/08-regexp-character-sets-and-ranges/2-find-time-2-formats/solution.md diff --git a/9-regular-expressions/05-regexp-character-sets-and-ranges/2-find-time-2-formats/task.md b/9-regular-expressions/08-regexp-character-sets-and-ranges/2-find-time-2-formats/task.md similarity index 100% rename from 9-regular-expressions/05-regexp-character-sets-and-ranges/2-find-time-2-formats/task.md rename to 9-regular-expressions/08-regexp-character-sets-and-ranges/2-find-time-2-formats/task.md diff --git a/9-regular-expressions/05-regexp-character-sets-and-ranges/article.md b/9-regular-expressions/08-regexp-character-sets-and-ranges/article.md similarity index 97% rename from 9-regular-expressions/05-regexp-character-sets-and-ranges/article.md rename to 9-regular-expressions/08-regexp-character-sets-and-ranges/article.md index 7204f2b1..3a94125c 100644 --- a/9-regular-expressions/05-regexp-character-sets-and-ranges/article.md +++ b/9-regular-expressions/08-regexp-character-sets-and-ranges/article.md @@ -44,7 +44,7 @@ alert( "Exception 0xAF".match(/x[0-9A-F][0-9A-F]/g) ); // xAF Please note that in the word `subject:Exception` there's a substring `subject:xce`. It didn't match the pattern, because the letters are lowercase, while in the set `pattern:[0-9A-F]` they are uppercase. -If we want to find it too, then we can add a range `a-f`: `pattern:[0-9A-Fa-f]`. The `i` flag would allow lowercase too. +If we want to find it too, then we can add a range `a-f`: `pattern:[0-9A-Fa-f]`. The `pattern:i` flag would allow lowercase too. **Character classes are shorthands for certain character sets.** @@ -58,7 +58,7 @@ We can use character classes inside `[…]` as well. For instance, we want to match all wordly characters or a dash, for words like "twenty-third". We can't do it with `pattern:\w+`, because `pattern:\w` class does not include a dash. But we can use `pattern:[\w-]`. -We also can use several classes, for example `pattern:[\s\S]` matches spaces or non-spaces -- any character. That's wider than a dot `"."`, because the dot matches any character except a newline (unless `s` flag is set). +We also can use several classes, for example `pattern:[\s\S]` matches spaces or non-spaces -- any character. That's wider than a dot `"."`, because the dot matches any character except a newline (unless `pattern:s` flag is set). ## Excluding ranges @@ -69,7 +69,7 @@ They are denoted by a caret character `^` at the start and match any character * For instance: - `pattern:[^aeyo]` -- any character except `'a'`, `'e'`, `'y'` or `'o'`. -- `pattern:[^0-9]` -- any character except a digit, the same as `\D`. +- `pattern:[^0-9]` -- any character except a digit, the same as `pattern:\D`. - `pattern:[^\s]` -- any non-space character, same as `\S`. The example below looks for any characters except letters, digits and spaces: diff --git a/9-regular-expressions/07-regexp-quantifiers/1-find-text-manydots/solution.md b/9-regular-expressions/09-regexp-quantifiers/1-find-text-manydots/solution.md similarity index 100% rename from 9-regular-expressions/07-regexp-quantifiers/1-find-text-manydots/solution.md rename to 9-regular-expressions/09-regexp-quantifiers/1-find-text-manydots/solution.md diff --git a/9-regular-expressions/07-regexp-quantifiers/1-find-text-manydots/task.md b/9-regular-expressions/09-regexp-quantifiers/1-find-text-manydots/task.md similarity index 100% rename from 9-regular-expressions/07-regexp-quantifiers/1-find-text-manydots/task.md rename to 9-regular-expressions/09-regexp-quantifiers/1-find-text-manydots/task.md diff --git a/9-regular-expressions/07-regexp-quantifiers/2-find-html-colors-6hex/solution.md b/9-regular-expressions/09-regexp-quantifiers/2-find-html-colors-6hex/solution.md similarity index 91% rename from 9-regular-expressions/07-regexp-quantifiers/2-find-html-colors-6hex/solution.md rename to 9-regular-expressions/09-regexp-quantifiers/2-find-html-colors-6hex/solution.md index 4e85285b..d4d297a1 100644 --- a/9-regular-expressions/07-regexp-quantifiers/2-find-html-colors-6hex/solution.md +++ b/9-regular-expressions/09-regexp-quantifiers/2-find-html-colors-6hex/solution.md @@ -1,6 +1,6 @@ We need to look for `#` followed by 6 hexadecimal characters. -A hexadecimal character can be described as `pattern:[0-9a-fA-F]`. Or if we use the `i` flag, then just `pattern:[0-9a-f]`. +A hexadecimal character can be described as `pattern:[0-9a-fA-F]`. Or if we use the `pattern:i` flag, then just `pattern:[0-9a-f]`. Then we can look for 6 of them using the quantifier `pattern:{6}`. diff --git a/9-regular-expressions/07-regexp-quantifiers/2-find-html-colors-6hex/task.md b/9-regular-expressions/09-regexp-quantifiers/2-find-html-colors-6hex/task.md similarity index 100% rename from 9-regular-expressions/07-regexp-quantifiers/2-find-html-colors-6hex/task.md rename to 9-regular-expressions/09-regexp-quantifiers/2-find-html-colors-6hex/task.md diff --git a/9-regular-expressions/07-regexp-quantifiers/article.md b/9-regular-expressions/09-regexp-quantifiers/article.md similarity index 97% rename from 9-regular-expressions/07-regexp-quantifiers/article.md rename to 9-regular-expressions/09-regexp-quantifiers/article.md index 7f382dcc..9b70d722 100644 --- a/9-regular-expressions/07-regexp-quantifiers/article.md +++ b/9-regular-expressions/09-regexp-quantifiers/article.md @@ -2,7 +2,7 @@ Let's say we have a string like `+7(903)-123-45-67` and want to find all numbers in it. But unlike before, we are interested not in single digits, but full numbers: `7, 903, 123, 45, 67`. -A number is a sequence of 1 or more digits `\d`. To mark how many we need, we need to append a *quantifier*. +A number is a sequence of 1 or more digits `pattern:\d`. To mark how many we need, we need to append a *quantifier*. ## Quantity {n} diff --git a/9-regular-expressions/08-regexp-greedy-and-lazy/1-lazy-greedy/solution.md b/9-regular-expressions/10-regexp-greedy-and-lazy/1-lazy-greedy/solution.md similarity index 100% rename from 9-regular-expressions/08-regexp-greedy-and-lazy/1-lazy-greedy/solution.md rename to 9-regular-expressions/10-regexp-greedy-and-lazy/1-lazy-greedy/solution.md diff --git a/9-regular-expressions/08-regexp-greedy-and-lazy/1-lazy-greedy/task.md b/9-regular-expressions/10-regexp-greedy-and-lazy/1-lazy-greedy/task.md similarity index 100% rename from 9-regular-expressions/08-regexp-greedy-and-lazy/1-lazy-greedy/task.md rename to 9-regular-expressions/10-regexp-greedy-and-lazy/1-lazy-greedy/task.md diff --git a/9-regular-expressions/08-regexp-greedy-and-lazy/3-find-html-comments/solution.md b/9-regular-expressions/10-regexp-greedy-and-lazy/3-find-html-comments/solution.md similarity index 100% rename from 9-regular-expressions/08-regexp-greedy-and-lazy/3-find-html-comments/solution.md rename to 9-regular-expressions/10-regexp-greedy-and-lazy/3-find-html-comments/solution.md diff --git a/9-regular-expressions/08-regexp-greedy-and-lazy/3-find-html-comments/task.md b/9-regular-expressions/10-regexp-greedy-and-lazy/3-find-html-comments/task.md similarity index 100% rename from 9-regular-expressions/08-regexp-greedy-and-lazy/3-find-html-comments/task.md rename to 9-regular-expressions/10-regexp-greedy-and-lazy/3-find-html-comments/task.md diff --git a/9-regular-expressions/08-regexp-greedy-and-lazy/4-find-html-tags-greedy-lazy/solution.md b/9-regular-expressions/10-regexp-greedy-and-lazy/4-find-html-tags-greedy-lazy/solution.md similarity index 100% rename from 9-regular-expressions/08-regexp-greedy-and-lazy/4-find-html-tags-greedy-lazy/solution.md rename to 9-regular-expressions/10-regexp-greedy-and-lazy/4-find-html-tags-greedy-lazy/solution.md diff --git a/9-regular-expressions/08-regexp-greedy-and-lazy/4-find-html-tags-greedy-lazy/task.md b/9-regular-expressions/10-regexp-greedy-and-lazy/4-find-html-tags-greedy-lazy/task.md similarity index 100% rename from 9-regular-expressions/08-regexp-greedy-and-lazy/4-find-html-tags-greedy-lazy/task.md rename to 9-regular-expressions/10-regexp-greedy-and-lazy/4-find-html-tags-greedy-lazy/task.md diff --git a/9-regular-expressions/08-regexp-greedy-and-lazy/article.md b/9-regular-expressions/10-regexp-greedy-and-lazy/article.md similarity index 100% rename from 9-regular-expressions/08-regexp-greedy-and-lazy/article.md rename to 9-regular-expressions/10-regexp-greedy-and-lazy/article.md diff --git a/9-regular-expressions/08-regexp-greedy-and-lazy/witch_greedy1.svg b/9-regular-expressions/10-regexp-greedy-and-lazy/witch_greedy1.svg similarity index 100% rename from 9-regular-expressions/08-regexp-greedy-and-lazy/witch_greedy1.svg rename to 9-regular-expressions/10-regexp-greedy-and-lazy/witch_greedy1.svg diff --git a/9-regular-expressions/08-regexp-greedy-and-lazy/witch_greedy2.svg b/9-regular-expressions/10-regexp-greedy-and-lazy/witch_greedy2.svg similarity index 100% rename from 9-regular-expressions/08-regexp-greedy-and-lazy/witch_greedy2.svg rename to 9-regular-expressions/10-regexp-greedy-and-lazy/witch_greedy2.svg diff --git a/9-regular-expressions/08-regexp-greedy-and-lazy/witch_greedy3.svg b/9-regular-expressions/10-regexp-greedy-and-lazy/witch_greedy3.svg similarity index 100% rename from 9-regular-expressions/08-regexp-greedy-and-lazy/witch_greedy3.svg rename to 9-regular-expressions/10-regexp-greedy-and-lazy/witch_greedy3.svg diff --git a/9-regular-expressions/08-regexp-greedy-and-lazy/witch_greedy4.svg b/9-regular-expressions/10-regexp-greedy-and-lazy/witch_greedy4.svg similarity index 100% rename from 9-regular-expressions/08-regexp-greedy-and-lazy/witch_greedy4.svg rename to 9-regular-expressions/10-regexp-greedy-and-lazy/witch_greedy4.svg diff --git a/9-regular-expressions/08-regexp-greedy-and-lazy/witch_greedy5.svg b/9-regular-expressions/10-regexp-greedy-and-lazy/witch_greedy5.svg similarity index 100% rename from 9-regular-expressions/08-regexp-greedy-and-lazy/witch_greedy5.svg rename to 9-regular-expressions/10-regexp-greedy-and-lazy/witch_greedy5.svg diff --git a/9-regular-expressions/08-regexp-greedy-and-lazy/witch_greedy6.svg b/9-regular-expressions/10-regexp-greedy-and-lazy/witch_greedy6.svg similarity index 100% rename from 9-regular-expressions/08-regexp-greedy-and-lazy/witch_greedy6.svg rename to 9-regular-expressions/10-regexp-greedy-and-lazy/witch_greedy6.svg diff --git a/9-regular-expressions/08-regexp-greedy-and-lazy/witch_lazy3.svg b/9-regular-expressions/10-regexp-greedy-and-lazy/witch_lazy3.svg similarity index 100% rename from 9-regular-expressions/08-regexp-greedy-and-lazy/witch_lazy3.svg rename to 9-regular-expressions/10-regexp-greedy-and-lazy/witch_lazy3.svg diff --git a/9-regular-expressions/08-regexp-greedy-and-lazy/witch_lazy4.svg b/9-regular-expressions/10-regexp-greedy-and-lazy/witch_lazy4.svg similarity index 100% rename from 9-regular-expressions/08-regexp-greedy-and-lazy/witch_lazy4.svg rename to 9-regular-expressions/10-regexp-greedy-and-lazy/witch_lazy4.svg diff --git a/9-regular-expressions/08-regexp-greedy-and-lazy/witch_lazy5.svg b/9-regular-expressions/10-regexp-greedy-and-lazy/witch_lazy5.svg similarity index 100% rename from 9-regular-expressions/08-regexp-greedy-and-lazy/witch_lazy5.svg rename to 9-regular-expressions/10-regexp-greedy-and-lazy/witch_lazy5.svg diff --git a/9-regular-expressions/08-regexp-greedy-and-lazy/witch_lazy6.svg b/9-regular-expressions/10-regexp-greedy-and-lazy/witch_lazy6.svg similarity index 100% rename from 9-regular-expressions/08-regexp-greedy-and-lazy/witch_lazy6.svg rename to 9-regular-expressions/10-regexp-greedy-and-lazy/witch_lazy6.svg diff --git a/9-regular-expressions/09-regexp-groups/1-find-webcolor-3-or-6/solution.md b/9-regular-expressions/11-regexp-groups/1-find-webcolor-3-or-6/solution.md similarity index 100% rename from 9-regular-expressions/09-regexp-groups/1-find-webcolor-3-or-6/solution.md rename to 9-regular-expressions/11-regexp-groups/1-find-webcolor-3-or-6/solution.md diff --git a/9-regular-expressions/09-regexp-groups/1-find-webcolor-3-or-6/task.md b/9-regular-expressions/11-regexp-groups/1-find-webcolor-3-or-6/task.md similarity index 100% rename from 9-regular-expressions/09-regexp-groups/1-find-webcolor-3-or-6/task.md rename to 9-regular-expressions/11-regexp-groups/1-find-webcolor-3-or-6/task.md diff --git a/9-regular-expressions/09-regexp-groups/2-find-decimal-numbers/solution.md b/9-regular-expressions/11-regexp-groups/2-find-decimal-numbers/solution.md similarity index 100% rename from 9-regular-expressions/09-regexp-groups/2-find-decimal-numbers/solution.md rename to 9-regular-expressions/11-regexp-groups/2-find-decimal-numbers/solution.md diff --git a/9-regular-expressions/09-regexp-groups/2-find-decimal-numbers/task.md b/9-regular-expressions/11-regexp-groups/2-find-decimal-numbers/task.md similarity index 100% rename from 9-regular-expressions/09-regexp-groups/2-find-decimal-numbers/task.md rename to 9-regular-expressions/11-regexp-groups/2-find-decimal-numbers/task.md diff --git a/9-regular-expressions/09-regexp-groups/5-parse-expression/solution.md b/9-regular-expressions/11-regexp-groups/5-parse-expression/solution.md similarity index 100% rename from 9-regular-expressions/09-regexp-groups/5-parse-expression/solution.md rename to 9-regular-expressions/11-regexp-groups/5-parse-expression/solution.md diff --git a/9-regular-expressions/09-regexp-groups/5-parse-expression/task.md b/9-regular-expressions/11-regexp-groups/5-parse-expression/task.md similarity index 100% rename from 9-regular-expressions/09-regexp-groups/5-parse-expression/task.md rename to 9-regular-expressions/11-regexp-groups/5-parse-expression/task.md diff --git a/9-regular-expressions/09-regexp-groups/article.md b/9-regular-expressions/11-regexp-groups/article.md similarity index 100% rename from 9-regular-expressions/09-regexp-groups/article.md rename to 9-regular-expressions/11-regexp-groups/article.md diff --git a/9-regular-expressions/09-regexp-groups/regexp-nested-groups.svg b/9-regular-expressions/11-regexp-groups/regexp-nested-groups.svg similarity index 100% rename from 9-regular-expressions/09-regexp-groups/regexp-nested-groups.svg rename to 9-regular-expressions/11-regexp-groups/regexp-nested-groups.svg diff --git a/9-regular-expressions/12-regexp-anchors/2-test-mac/solution.md b/9-regular-expressions/12-regexp-anchors/2-test-mac/solution.md deleted file mode 100644 index 422bc65e..00000000 --- a/9-regular-expressions/12-regexp-anchors/2-test-mac/solution.md +++ /dev/null @@ -1,21 +0,0 @@ -A two-digit hex number is `pattern:[0-9a-f]{2}` (assuming the `pattern:i` flag is enabled). - -We need that number `NN`, and then `:NN` repeated 5 times (more numbers); - -The regexp is: `pattern:[0-9a-f]{2}(:[0-9a-f]{2}){5}` - -Now let's show that the match should capture all the text: start at the beginning and end at the end. That's done by wrapping the pattern in `pattern:^...$`. - -Finally: - -```js run -let reg = /^[0-9a-fA-F]{2}(:[0-9a-fA-F]{2}){5}$/i; - -alert( reg.test('01:32:54:67:89:AB') ); // true - -alert( reg.test('0132546789AB') ); // false (no colons) - -alert( reg.test('01:32:54:67:89') ); // false (5 numbers, need 6) - -alert( reg.test('01:32:54:67:89:ZZ') ) // false (ZZ in the end) -``` diff --git a/9-regular-expressions/12-regexp-anchors/2-test-mac/task.md b/9-regular-expressions/12-regexp-anchors/2-test-mac/task.md deleted file mode 100644 index e7265598..00000000 --- a/9-regular-expressions/12-regexp-anchors/2-test-mac/task.md +++ /dev/null @@ -1,20 +0,0 @@ -# Check MAC-address - -[MAC-address](https://en.wikipedia.org/wiki/MAC_address) of a network interface consists of 6 two-digit hex numbers separated by a colon. - -For instance: `subject:'01:32:54:67:89:AB'`. - -Write a regexp that checks whether a string is MAC-address. - -Usage: -```js -let reg = /your regexp/; - -alert( reg.test('01:32:54:67:89:AB') ); // true - -alert( reg.test('0132546789AB') ); // false (no colons) - -alert( reg.test('01:32:54:67:89') ); // false (5 numbers, must be 6) - -alert( reg.test('01:32:54:67:89:ZZ') ) // false (ZZ ad the end) -``` diff --git a/9-regular-expressions/12-regexp-anchors/article.md b/9-regular-expressions/12-regexp-anchors/article.md deleted file mode 100644 index 0c2dd578..00000000 --- a/9-regular-expressions/12-regexp-anchors/article.md +++ /dev/null @@ -1,55 +0,0 @@ -# String start ^ and finish $ - -The caret `pattern:'^'` and dollar `pattern:'$'` characters have special meaning in a regexp. They are called "anchors". - -The caret `pattern:^` matches at the beginning of the text, and the dollar `pattern:$` -- in the end. - -For instance, let's test if the text starts with `Mary`: - -```js run -let str1 = "Mary had a little lamb, it's fleece was white as snow"; -let str2 = 'Everywhere Mary went, the lamp was sure to go'; - -alert( /^Mary/.test(str1) ); // true -alert( /^Mary/.test(str2) ); // false -``` - -The pattern `pattern:^Mary` means: "the string start and then Mary". - -Now let's test whether the text ends with an email. - -To match an email, we can use a regexp `pattern:[-.\w]+@([\w-]+\.)+[\w-]{2,20}`. - -To test whether the string ends with the email, let's add `pattern:$` to the pattern: - -```js run -let reg = /[-.\w]+@([\w-]+\.)+[\w-]{2,20}$/g; - -let str1 = 'My email is mail@site.com'; -let str2 = 'Everywhere Mary went, the lamp was sure to go'; - -alert( reg.test(str1) ); // true -alert( reg.test(str2) ); // false -``` - -We can use both anchors together to check whether the string exactly follows the pattern. That's often used for validation. - -For instance we want to check that `str` is exactly a color in the form `#` plus 6 hex digits. The pattern for the color is `pattern:#[0-9a-f]{6}`. - -To check that the *whole string* exactly matches it, we add `pattern:^...$`: - -```js run -let str = "#abcdef"; - -alert( /^#[0-9a-f]{6}$/i.test(str) ); // true -``` - -The regexp engine looks for the text start, then the color, and then immediately the text end. Just what we need. - -```smart header="Anchors have zero length" -Anchors just like `\b` are tests. They have zero-width. - -In other words, they do not match a character, but rather force the regexp engine to check the condition (text start/end). -``` - -The behavior of anchors changes if there's a flag `pattern:m` (multiline mode). We'll explore it in the next chapter. diff --git a/9-regular-expressions/10-regexp-backreferences/article.md b/9-regular-expressions/12-regexp-backreferences/article.md similarity index 100% rename from 9-regular-expressions/10-regexp-backreferences/article.md rename to 9-regular-expressions/12-regexp-backreferences/article.md diff --git a/9-regular-expressions/11-regexp-alternation/01-find-programming-language/solution.md b/9-regular-expressions/13-regexp-alternation/01-find-programming-language/solution.md similarity index 100% rename from 9-regular-expressions/11-regexp-alternation/01-find-programming-language/solution.md rename to 9-regular-expressions/13-regexp-alternation/01-find-programming-language/solution.md diff --git a/9-regular-expressions/11-regexp-alternation/01-find-programming-language/task.md b/9-regular-expressions/13-regexp-alternation/01-find-programming-language/task.md similarity index 100% rename from 9-regular-expressions/11-regexp-alternation/01-find-programming-language/task.md rename to 9-regular-expressions/13-regexp-alternation/01-find-programming-language/task.md diff --git a/9-regular-expressions/11-regexp-alternation/02-find-matching-bbtags/solution.md b/9-regular-expressions/13-regexp-alternation/02-find-matching-bbtags/solution.md similarity index 79% rename from 9-regular-expressions/11-regexp-alternation/02-find-matching-bbtags/solution.md rename to 9-regular-expressions/13-regexp-alternation/02-find-matching-bbtags/solution.md index e448a4b1..dddaf962 100644 --- a/9-regular-expressions/11-regexp-alternation/02-find-matching-bbtags/solution.md +++ b/9-regular-expressions/13-regexp-alternation/02-find-matching-bbtags/solution.md @@ -1,7 +1,7 @@ Opening tag is `pattern:\[(b|url|quote)\]`. -Then to find everything till the closing tag -- let's use the pattern `pattern:.*?` with flag `s` to match any character including the newline and then add a backreference to the closing tag. +Then to find everything till the closing tag -- let's use the pattern `pattern:.*?` with flag `pattern:s` to match any character including the newline and then add a backreference to the closing tag. The full pattern: `pattern:\[(b|url|quote)\].*?\[/\1\]`. diff --git a/9-regular-expressions/11-regexp-alternation/02-find-matching-bbtags/task.md b/9-regular-expressions/13-regexp-alternation/02-find-matching-bbtags/task.md similarity index 100% rename from 9-regular-expressions/11-regexp-alternation/02-find-matching-bbtags/task.md rename to 9-regular-expressions/13-regexp-alternation/02-find-matching-bbtags/task.md diff --git a/9-regular-expressions/11-regexp-alternation/03-match-quoted-string/solution.md b/9-regular-expressions/13-regexp-alternation/03-match-quoted-string/solution.md similarity index 100% rename from 9-regular-expressions/11-regexp-alternation/03-match-quoted-string/solution.md rename to 9-regular-expressions/13-regexp-alternation/03-match-quoted-string/solution.md diff --git a/9-regular-expressions/11-regexp-alternation/03-match-quoted-string/task.md b/9-regular-expressions/13-regexp-alternation/03-match-quoted-string/task.md similarity index 100% rename from 9-regular-expressions/11-regexp-alternation/03-match-quoted-string/task.md rename to 9-regular-expressions/13-regexp-alternation/03-match-quoted-string/task.md diff --git a/9-regular-expressions/11-regexp-alternation/04-match-exact-tag/solution.md b/9-regular-expressions/13-regexp-alternation/04-match-exact-tag/solution.md similarity index 100% rename from 9-regular-expressions/11-regexp-alternation/04-match-exact-tag/solution.md rename to 9-regular-expressions/13-regexp-alternation/04-match-exact-tag/solution.md diff --git a/9-regular-expressions/11-regexp-alternation/04-match-exact-tag/task.md b/9-regular-expressions/13-regexp-alternation/04-match-exact-tag/task.md similarity index 100% rename from 9-regular-expressions/11-regexp-alternation/04-match-exact-tag/task.md rename to 9-regular-expressions/13-regexp-alternation/04-match-exact-tag/task.md diff --git a/9-regular-expressions/11-regexp-alternation/article.md b/9-regular-expressions/13-regexp-alternation/article.md similarity index 100% rename from 9-regular-expressions/11-regexp-alternation/article.md rename to 9-regular-expressions/13-regexp-alternation/article.md diff --git a/9-regular-expressions/13-regexp-multiline-mode/article.md b/9-regular-expressions/13-regexp-multiline-mode/article.md deleted file mode 100644 index 955d9601..00000000 --- a/9-regular-expressions/13-regexp-multiline-mode/article.md +++ /dev/null @@ -1,75 +0,0 @@ -# Multiline mode, flag "m" - -The multiline mode is enabled by the flag `pattern:/.../m`. - -It only affects the behavior of `pattern:^` and `pattern:$`. - -In the multiline mode they match not only at the beginning and end of the string, but also at start/end of line. - -## Line start ^ - -In the example below the text has multiple lines. The pattern `pattern:/^\d+/gm` takes a number from the beginning of each one: - -```js run -let str = `1st place: Winnie -2nd place: Piglet -33rd place: Eeyore`; - -*!* -alert( str.match(/^\d+/gm) ); // 1, 2, 33 -*/!* -``` - -The regexp engine moves along the text and looks for a line start `pattern:^`, when finds -- continues to match the rest of the pattern `pattern:\d+`. - -Without the flag `pattern:/.../m` only the first number is matched: - -```js run -let str = `1st place: Winnie -2nd place: Piglet -33rd place: Eeyore`; - -*!* -alert( str.match(/^\d+/g) ); // 1 -*/!* -``` - -That's because by default a caret `pattern:^` only matches at the beginning of the text, and in the multiline mode -- at the start of any line. - -## Line end $ - -The dollar sign `pattern:$` behaves similarly. - -The regular expression `pattern:\w+$` finds the last word in every line - -```js run -let str = `1st place: Winnie -2nd place: Piglet -33rd place: Eeyore`; - -alert( str.match(/\w+$/gim) ); // Winnie,Piglet,Eeyore -``` - -Without the `pattern:/.../m` flag the dollar `pattern:$` would only match the end of the whole string, so only the very last word would be found. - -## Anchors ^$ versus \n - -To find a newline, we can use not only `pattern:^` and `pattern:$`, but also the newline character `\n`. - -The first difference is that unlike anchors, the character `\n` "consumes" the newline character and adds it to the result. - -For instance, here we use it instead of `pattern:$`: - -```js run -let str = `1st place: Winnie -2nd place: Piglet -33rd place: Eeyore`; - -alert( str.match(/\w+\n/gim) ); // Winnie\n,Piglet\n -``` - -Here every match is a word plus a newline character. - -And one more difference -- the newline `\n` does not match at the string end. That's why `Eeyore` is not found in the example above. - -So, anchors are usually better, they are closer to what we want to get. diff --git a/9-regular-expressions/14-regexp-lookahead-lookbehind/article.md b/9-regular-expressions/14-regexp-lookahead-lookbehind/article.md index e877cae4..8e36fb0b 100644 --- a/9-regular-expressions/14-regexp-lookahead-lookbehind/article.md +++ b/9-regular-expressions/14-regexp-lookahead-lookbehind/article.md @@ -101,9 +101,9 @@ Lookaround types: | Pattern | type | matches | |--------------------|------------------|---------| -| `pattern:x(?=y)` | Positive lookahead | `x` if followed by `y` | -| `pattern:x(?!y)` | Negative lookahead | `x` if not followed by `y` | -| `pattern:(?<=y)x` | Positive lookbehind | `x` if after `y` | -| `pattern:(?. - -Let's briefly review them here. In short, normally characters are encoded with 2 bytes. That gives us 65536 characters maximum. But there are more characters in the world. - -So certain rare characters are encoded with 4 bytes, like `𝒳` (mathematical X) or `😄` (a smile). - -Here are the unicode values to compare: - -| Character | Unicode | Bytes | -|------------|---------|--------| -| `a` | 0x0061 | 2 | -| `≈` | 0x2248 | 2 | -|`𝒳`| 0x1d4b3 | 4 | -|`𝒴`| 0x1d4b4 | 4 | -|`😄`| 0x1f604 | 4 | - -So characters like `a` and `≈` occupy 2 bytes, and those rare ones take 4. - -The unicode is made in such a way that the 4-byte characters only have a meaning as a whole. - -In the past JavaScript did not know about that, and many string methods still have problems. For instance, `length` thinks that here are two characters: - -```js run -alert('😄'.length); // 2 -alert('𝒳'.length); // 2 -``` - -...But we can see that there's only one, right? The point is that `length` treats 4 bytes as two 2-byte characters. That's incorrect, because they must be considered only together (so-called "surrogate pair"). - -Normally, regular expressions also treat "long characters" as two 2-byte ones. - -That leads to odd results, for instance let's try to find `pattern:[𝒳𝒴]` in the string `subject:𝒳`: - -```js run -alert( '𝒳'.match(/[𝒳𝒴]/) ); // odd result (wrong match actually, "half-character") -``` - -The result is wrong, because by default the regexp engine does not understand surrogate pairs. - -So, it thinks that `[𝒳𝒴]` are not two, but four characters: -1. the left half of `𝒳` `(1)`, -2. the right half of `𝒳` `(2)`, -3. the left half of `𝒴` `(3)`, -4. the right half of `𝒴` `(4)`. - -We can list them like this: - -```js run -for(let i=0; i<'𝒳𝒴'.length; i++) { - alert('𝒳𝒴'.charCodeAt(i)); // 55349, 56499, 55349, 56500 -}; -``` - -So it finds only the "left half" of `𝒳`. - -In other words, the search works like `'12'.match(/[1234]/)`: only `1` is returned. - -## The "u" flag - -The `/.../u` flag fixes that. - -It enables surrogate pairs in the regexp engine, so the result is correct: - -```js run -alert( '𝒳'.match(/[𝒳𝒴]/u) ); // 𝒳 -``` - -Let's see one more example. - -If we forget the `u` flag and accidentally use surrogate pairs, then we can get an error: - -```js run -'𝒳'.match(/[𝒳-𝒴]/); // SyntaxError: invalid range in character class -``` - -Normally, regexps understand `[a-z]` as a "range of characters with codes between codes of `a` and `z`. - -But without `u` flag, surrogate pairs are assumed to be a "pair of independent characters", so `[𝒳-𝒴]` is like `[<55349><56499>-<55349><56500>]` (replaced each surrogate pair with code points). Now we can clearly see that the range `56499-55349` is unacceptable, as the left range border must be less than the right one. - -Using the `u` flag makes it work right: - -```js run -alert( '𝒴'.match(/[𝒳-𝒵]/u) ); // 𝒴 -``` diff --git a/9-regular-expressions/21-regexp-unicode-properties/article.md b/9-regular-expressions/21-regexp-unicode-properties/article.md deleted file mode 100644 index 2bb031d7..00000000 --- a/9-regular-expressions/21-regexp-unicode-properties/article.md +++ /dev/null @@ -1,86 +0,0 @@ - -# Unicode character properties \p - -[Unicode](https://en.wikipedia.org/wiki/Unicode), the encoding format used by JavaScript strings, has a lot of properties for different characters (or, technically, code points). They describe which "categories" character belongs to, and a variety of technical details. - -In regular expressions these can be set by `\p{…}`. And there must be flag `'u'`. - -For instance, `\p{Letter}` denotes a letter in any of language. We can also use `\p{L}`, as `L` is an alias of `Letter`, there are shorter aliases for almost every property. - -Here's the main tree of properties: - -- Letter `L`: - - lowercase `Ll`, modifier `Lm`, titlecase `Lt`, uppercase `Lu`, other `Lo` -- Number `N`: - - decimal digit `Nd`, letter number `Nl`, other `No` -- Punctuation `P`: - - connector `Pc`, dash `Pd`, initial quote `Pi`, final quote `Pf`, open `Ps`, close `Pe`, other `Po` -- Mark `M` (accents etc): - - spacing combining `Mc`, enclosing `Me`, non-spacing `Mn` -- Symbol `S`: - - currency `Sc`, modifier `Sk`, math `Sm`, other `So` -- Separator `Z`: - - line `Zl`, paragraph `Zp`, space `Zs` -- Other `C`: - - control `Cc`, format `Cf`, not assigned `Cn`, private use `Co`, surrogate `Cs` - -```smart header="More information" -Interested to see which characters belong to a property? There's a tool at for that. - -You could also explore properties at [Character Property Index](http://unicode.org/cldr/utility/properties.jsp). - -For the full Unicode Character Database in text format (along with all properties), see . -``` - -There are also other derived categories, like: -- `Alphabetic` (`Alpha`), includes Letters `L`, plus letter numbers `Nl` (e.g. roman numbers Ⅻ), plus some other symbols `Other_Alphabetic` (`OAltpa`). -- `Hex_Digit` includes hexadecimal digits: `0-9`, `a-f`. -- ...Unicode is a big beast, it includes a lot of properties. - -For instance, let's look for a 6-digit hex number: - -```js run -let reg = /\p{Hex_Digit}{6}/u; // flag 'u' is required - -alert("color: #123ABC".match(reg)); // 123ABC -``` - -There are also properties with a value. For instance, Unicode "Script" (a writing system) can be Cyrillic, Greek, Arabic, Han (Chinese) etc, the [list is long]("https://en.wikipedia.org/wiki/Script_(Unicode)"). - -To search for characters in certain scripts ("alphabets"), we should supply `Script=`, e.g. to search for cyrillic letters: `\p{sc=Cyrillic}`, for Chinese glyphs: `\p{sc=Han}`, etc: - -```js run -let regexp = /\p{sc=Han}+/gu; // get chinese words - -let str = `Hello Привет 你好 123_456`; - -alert( str.match(regexp) ); // 你好 -``` - -## Building multi-language \w - -The pattern `pattern:\w` means "wordly characters", but doesn't work for languages that use non-Latin alphabets, such as Cyrillic and others. It's just a shorthand for `[a-zA-Z0-9_]`, so `pattern:\w+` won't find any Chinese words etc. - -Let's make a "universal" regexp, that looks for wordly characters in any language. That's easy to do using Unicode properties: - -```js -/[\p{Alphabetic}\p{Mark}\p{Decimal_Number}\p{Connector_Punctuation}\p{Join_Control}]/u -``` - -Let's decipher. Just as `pattern:\w` is the same as `pattern:[a-zA-Z0-9_]`, we're making a set of our own, that includes: - -- `Alphabetic` for letters, -- `Mark` for accents, as in Unicode accents may be represented by separate code points, -- `Decimal_Number` for numbers, -- `Connector_Punctuation` for the `'_'` character and alike, -- `Join_Control` -– two special code points with hex codes `200c` and `200d`, used in ligatures e.g. in arabic. - -Or, if we replace long names with aliases (a list of aliases [here](https://www.unicode.org/Public/UCD/latest/ucd/PropertyValueAliases.txt)): - -```js run -let regexp = /([\p{Alpha}\p{M}\p{Nd}\p{Pc}\p{Join_C}]+)/gu; - -let str = `Hello Привет 你好 123_456`; - -alert( str.match(regexp) ); // Hello,Привет,你好,123_456 -``` diff --git a/9-regular-expressions/index.md b/9-regular-expressions/index.md index 7499c584..ac25aaa6 100644 --- a/9-regular-expressions/index.md +++ b/9-regular-expressions/index.md @@ -1,7 +1,3 @@ # Regular expressions Regular expressions is a powerful way of doing search and replace in strings. - -In JavaScript regular expressions are implemented using objects of a built-in `RegExp` class and integrated with strings. - -Please note that regular expressions vary between programming languages. In this tutorial we concentrate on JavaScript. Of course there's a lot in common, but they are a somewhat different in Perl, Ruby, PHP etc.