diff --git a/9-regular-expressions/11-regexp-groups/01-test-mac/solution.md b/9-regular-expressions/11-regexp-groups/01-test-mac/solution.md new file mode 100644 index 00000000..c16f0565 --- /dev/null +++ b/9-regular-expressions/11-regexp-groups/01-test-mac/solution.md @@ -0,0 +1,21 @@ +A two-digit hex number is `pattern:[0-9a-f]{2}` (assuming the flag `pattern:i` is set). + +We need that number `NN`, and then `:NN` repeated 5 times (more numbers); + +The regexp is: `pattern:[0-9a-f]{2}(:[0-9a-f]{2}){5}` + +Now let's show that the match should capture all the text: start at the beginning and end at the end. That's done by wrapping the pattern in `pattern:^...$`. + +Finally: + +```js run +let reg = /^[0-9a-fA-F]{2}(:[0-9a-fA-F]{2}){5}$/i; + +alert( reg.test('01:32:54:67:89:AB') ); // true + +alert( reg.test('0132546789AB') ); // false (no colons) + +alert( reg.test('01:32:54:67:89') ); // false (5 numbers, need 6) + +alert( reg.test('01:32:54:67:89:ZZ') ) // false (ZZ in the end) +``` diff --git a/9-regular-expressions/11-regexp-groups/01-test-mac/task.md b/9-regular-expressions/11-regexp-groups/01-test-mac/task.md new file mode 100644 index 00000000..e7265598 --- /dev/null +++ b/9-regular-expressions/11-regexp-groups/01-test-mac/task.md @@ -0,0 +1,20 @@ +# Check MAC-address + +[MAC-address](https://en.wikipedia.org/wiki/MAC_address) of a network interface consists of 6 two-digit hex numbers separated by a colon. + +For instance: `subject:'01:32:54:67:89:AB'`. + +Write a regexp that checks whether a string is MAC-address. + +Usage: +```js +let reg = /your regexp/; + +alert( reg.test('01:32:54:67:89:AB') ); // true + +alert( reg.test('0132546789AB') ); // false (no colons) + +alert( reg.test('01:32:54:67:89') ); // false (5 numbers, must be 6) + +alert( reg.test('01:32:54:67:89:ZZ') ) // false (ZZ ad the end) +``` diff --git a/9-regular-expressions/11-regexp-groups/1-find-webcolor-3-or-6/solution.md b/9-regular-expressions/11-regexp-groups/02-find-webcolor-3-or-6/solution.md similarity index 100% rename from 9-regular-expressions/11-regexp-groups/1-find-webcolor-3-or-6/solution.md rename to 9-regular-expressions/11-regexp-groups/02-find-webcolor-3-or-6/solution.md diff --git a/9-regular-expressions/11-regexp-groups/1-find-webcolor-3-or-6/task.md b/9-regular-expressions/11-regexp-groups/02-find-webcolor-3-or-6/task.md similarity index 100% rename from 9-regular-expressions/11-regexp-groups/1-find-webcolor-3-or-6/task.md rename to 9-regular-expressions/11-regexp-groups/02-find-webcolor-3-or-6/task.md diff --git a/9-regular-expressions/11-regexp-groups/2-find-decimal-numbers/solution.md b/9-regular-expressions/11-regexp-groups/03-find-decimal-numbers/solution.md similarity index 100% rename from 9-regular-expressions/11-regexp-groups/2-find-decimal-numbers/solution.md rename to 9-regular-expressions/11-regexp-groups/03-find-decimal-numbers/solution.md diff --git a/9-regular-expressions/11-regexp-groups/2-find-decimal-numbers/task.md b/9-regular-expressions/11-regexp-groups/03-find-decimal-numbers/task.md similarity index 100% rename from 9-regular-expressions/11-regexp-groups/2-find-decimal-numbers/task.md rename to 9-regular-expressions/11-regexp-groups/03-find-decimal-numbers/task.md diff --git a/9-regular-expressions/11-regexp-groups/5-parse-expression/solution.md b/9-regular-expressions/11-regexp-groups/04-parse-expression/solution.md similarity index 100% rename from 9-regular-expressions/11-regexp-groups/5-parse-expression/solution.md rename to 9-regular-expressions/11-regexp-groups/04-parse-expression/solution.md diff --git a/9-regular-expressions/11-regexp-groups/5-parse-expression/task.md b/9-regular-expressions/11-regexp-groups/04-parse-expression/task.md similarity index 100% rename from 9-regular-expressions/11-regexp-groups/5-parse-expression/task.md rename to 9-regular-expressions/11-regexp-groups/04-parse-expression/task.md diff --git a/9-regular-expressions/11-regexp-groups/article.md b/9-regular-expressions/11-regexp-groups/article.md index 9a3bb04f..855568be 100644 --- a/9-regular-expressions/11-regexp-groups/article.md +++ b/9-regular-expressions/11-regexp-groups/article.md @@ -65,7 +65,7 @@ That regexp is not perfect, but mostly works and helps to fix accidental mistype ## Parentheses contents in the match -Parentheses are numbered from left to right. The search engine remembers the content matched by each of them and allows to get it in the result. +Parentheses are numbered from left to right. The search engine memorizes the content matched by each of them and allows to get it in the result. The method `str.match(regexp)`, if `regexp` has no flag `g`, looks for the first match and returns it as an array: @@ -347,4 +347,4 @@ If the parentheses have no name, then their contents is available in the match a We can also use parentheses contents in the replacement string in `str.replace`: by the number `$n` or the name `$`. -A group may be excluded from remembering by adding `pattern:?:` in its start. That's used when we need to apply a quantifier to the whole group, but don't remember it as a separate item in the results array. We also can't reference such parentheses in the replacement string. +A group may be excluded from numbering by adding `pattern:?:` in its start. That's used when we need to apply a quantifier to the whole group, but don't want it as a separate item in the results array. We also can't reference such parentheses in the replacement string. diff --git a/9-regular-expressions/12-regexp-backreferences/article.md b/9-regular-expressions/12-regexp-backreferences/article.md index eff5cab4..07d2ca07 100644 --- a/9-regular-expressions/12-regexp-backreferences/article.md +++ b/9-regular-expressions/12-regexp-backreferences/article.md @@ -1,31 +1,31 @@ -# Backreferences in pattern: \n and \k +# Backreferences in pattern: \N and \k -We can use the contents of capturing groups `(...)` not only in the result or in the replacement string, but also in the pattern itself. +We can use the contents of capturing groups `pattern:(...)` not only in the result or in the replacement string, but also in the pattern itself. -## Backreference by number: \n +## Backreference by number: \N -A group can be referenced in the pattern using `\n`, where `n` is the group number. +A group can be referenced in the pattern using `pattern:\N`, where `N` is the group number. -To make things clear let's consider a task. +To make clear why that's helpful, let's consider a task. -We need to find a quoted string: either a single-quoted `subject:'...'` or a double-quoted `subject:"..."` -- both variants need to match. +We need to find quoted strings: either single-quoted `subject:'...'` or a double-quoted `subject:"..."` -- both variants should match. -How to look for them? +How to find them? -We can put both kinds of quotes in the square brackets: `pattern:['"](.*?)['"]`, but it would find strings with mixed quotes, like `match:"...'` and `match:'..."`. That would lead to incorrect matches when one quote appears inside other ones, like the string `subject:"She's the one!"`: +We can put both kinds of quotes in the square brackets: `pattern:['"](.*?)['"]`, but it would find strings with mixed quotes, like `match:"...'` and `match:'..."`. That would lead to incorrect matches when one quote appears inside other ones, like in the string `subject:"She's the one!"`: ```js run let str = `He said: "She's the one!".`; let reg = /['"](.*?)['"]/g; -// The result is not what we expect +// The result is not what we'd like to have alert( str.match(reg) ); // "She' ``` -As we can see, the pattern found an opening quote `match:"`, then the text is consumed lazily till the other quote `match:'`, that closes the match. +As we can see, the pattern found an opening quote `match:"`, then the text is consumed till the other quote `match:'`, that closes the match. -To make sure that the pattern looks for the closing quote exactly the same as the opening one, we can wrap it into a capturing group and use the backreference. +To make sure that the pattern looks for the closing quote exactly the same as the opening one, we can wrap it into a capturing group and backreference it: `pattern:(['"])(.*?)\1`. Here's the correct code: @@ -39,20 +39,27 @@ let reg = /(['"])(.*?)\1/g; alert( str.match(reg) ); // "She's the one!" ``` -Now it works! The regular expression engine finds the first quote `pattern:(['"])` and remembers the content of `pattern:(...)`, that's the first capturing group. +Now it works! The regular expression engine finds the first quote `pattern:(['"])` and memorizes its content. That's the first capturing group. Further in the pattern `pattern:\1` means "find the same text as in the first group", exactly the same quote in our case. -Please note: +Similar to that, `pattern:\2` would mean the contents of the second group, `pattern:\3` - the 3rd group, and so on. -- To reference a group inside a replacement string -- we use `$1`, while in the pattern -- a backslash `\1`. -- If we use `?:` in the group, then we can't reference it. Groups that are excluded from capturing `(?:...)` are not remembered by the engine. +```smart +If we use `?:` in the group, then we can't reference it. Groups that are excluded from capturing `(?:...)` are not memorized by the engine. +``` + +```warn header="Don't mess up: in the pattern `pattern:\1`, in the replacement: `pattern:$1`" +In the replacement string we use a dollar sign: `pattern:$1`, while in the pattern - a backslash `pattern:\1`. +``` ## Backreference by name: `\k` -For named groups, we can backreference by `\k`. +If a regexp has many parentheses, it's convenient to give them names. -The same example with the named group: +To reference a named group we can use `pattern:\k<имя>`. + +In the example below the group with quotes is named `pattern:?`, so the backreference is `pattern:\k`: ```js run let str = `He said: "She's the one!".`; diff --git a/9-regular-expressions/13-regexp-alternation/article.md b/9-regular-expressions/13-regexp-alternation/article.md index b26f7e4a..5dcb9e86 100644 --- a/9-regular-expressions/13-regexp-alternation/article.md +++ b/9-regular-expressions/13-regexp-alternation/article.md @@ -18,7 +18,7 @@ let str = "First HTML appeared, then CSS, then JavaScript"; alert( str.match(reg) ); // 'HTML', 'CSS', 'JavaScript' ``` -We already know a similar thing -- square brackets. They allow to choose between multiple character, for instance `pattern:gr[ae]y` matches `match:gray` or `match:grey`. +We already saw a similar thing -- square brackets. They allow to choose between multiple characters, for instance `pattern:gr[ae]y` matches `match:gray` or `match:grey`. Square brackets allow only characters or character sets. Alternation allows any expressions. A regexp `pattern:A|B|C` means one of expressions `A`, `B` or `C`. @@ -27,30 +27,41 @@ For instance: - `pattern:gr(a|e)y` means exactly the same as `pattern:gr[ae]y`. - `pattern:gra|ey` means `match:gra` or `match:ey`. -To separate a part of the pattern for alternation we usually enclose it in parentheses, like this: `pattern:before(XXX|YYY)after`. +To apply alternation to a chosen part of the pattern, we can enclose it in parentheses: +- `pattern:I love HTML|CSS` matches `match:I love HTML` or `match:CSS`. +- `pattern:I love (HTML|CSS)` matches `match:I love HTML` or `match:I love CSS`. -## Regexp for time +## Example: regexp for time -In previous chapters there was a task to build a regexp for searching time in the form `hh:mm`, for instance `12:00`. But a simple `pattern:\d\d:\d\d` is too vague. It accepts `25:99` as the time (as 99 seconds match the pattern). +In previous articles there was a task to build a regexp for searching time in the form `hh:mm`, for instance `12:00`. But a simple `pattern:\d\d:\d\d` is too vague. It accepts `25:99` as the time (as 99 seconds match the pattern, but that time is invalid). -How can we make a better one? +How can we make a better pattern? -We can apply more careful matching. First, the hours: +We can use more careful matching. First, the hours: -- If the first digit is `0` or `1`, then the next digit can by anything. -- Or, if the first digit is `2`, then the next must be `pattern:[0-3]`. +- If the first digit is `0` or `1`, then the next digit can be any: `pattern:[01]\d`. +- Otherwise, if the first digit is `2`, then the next must be `pattern:[0-3]`. +- (no other first digit is allowed) -As a regexp: `pattern:[01]\d|2[0-3]`. +We can write both variants in a regexp using alternation: `pattern:[01]\d|2[0-3]`. -Next, the minutes must be from `0` to `59`. In the regexp language that means `pattern:[0-5]\d`: the first digit `0-5`, and then any digit. +Next, minutes must be from `00` to `59`. In the regular expression language that can be written as `pattern:[0-5]\d`: the first digit `0-5`, and then any digit. -Let's glue them together into the pattern: `pattern:[01]\d|2[0-3]:[0-5]\d`. +If we glue minutes and seconds together, we get the pattern: `pattern:[01]\d|2[0-3]:[0-5]\d`. We're almost done, but there's a problem. The alternation `pattern:|` now happens to be between `pattern:[01]\d` and `pattern:2[0-3]:[0-5]\d`. -That's wrong, as it should be applied only to hours `[01]\d` OR `2[0-3]`. That's a common mistake when starting to work with regular expressions. +That is: minutes are added to the second alternation variant, here's a clear picture: -The correct variant: +``` +[01]\d | 2[0-3]:[0-5]\d +``` + +That pattern looks for `pattern:[01]\d` or `pattern:2[0-3]:[0-5]\d`. + +But that's wrong, the alternation should only be used in the "hours" part of the regular expression, to allow `pattern:[01]\d` OR `pattern:2[0-3]`. Let's correct that by enclosing "hours" into parentheses: `pattern:([01]\d|2[0-3]):[0-5]\d`. + +The final solution: ```js run let reg = /([01]\d|2[0-3]):[0-5]\d/g; diff --git a/9-regular-expressions/14-regexp-lookahead-lookbehind/2-insert-after-head/solution.md b/9-regular-expressions/14-regexp-lookahead-lookbehind/2-insert-after-head/solution.md new file mode 100644 index 00000000..980a7fe6 --- /dev/null +++ b/9-regular-expressions/14-regexp-lookahead-lookbehind/2-insert-after-head/solution.md @@ -0,0 +1,29 @@ + +Для того, чтобы вставить после тега ``, нужно вначале его найти. Будем использовать регулярное выражение `pattern:`. + +Далее, нам нужно оставить сам тег `` на месте и добавить текст после него. + +Это можно сделать вот так: +```js run +let str = '......'; +str = str.replace(//, '$&

Hello

'); + +alert(str); // ...

Hello

... +``` + +В строке замены `$&` означает само совпадение, то есть мы заменяем `pattern:` заменяется на самого себя плюс `

Hello

`. + +Альтернативный вариант - использовать ретроспективную проверку: + +```js run +let str = '......'; +str = str.replace(/(?<=)/, `

Hello

`); + +alert(str); // ...

Hello

... +``` + +Такое регулярное выражение на каждой позиции будет проверять, не идёт ли прямо перед ней `pattern:`. Если да - совпадение найдено. Но сам тег `pattern:` в совпадение не входит, он только участвует в проверке. А других символов после проверки в нём нет, так что текст совпадения будет пустым. + +Происходит замена "пустой строки", перед которой идёт `pattern:` на `

Hello

`. Что, как раз, и есть вставка этой строки после ``. + +P.S. Этому регулярному выражению не помешают флаги: `pattern://si`, чтобы в "точку" входил перевод строки (тег может занимать несколько строк), а также чтобы теги в другом регистре типа `match:` тоже находились. diff --git a/9-regular-expressions/14-regexp-lookahead-lookbehind/2-insert-after-head/task.md b/9-regular-expressions/14-regexp-lookahead-lookbehind/2-insert-after-head/task.md new file mode 100644 index 00000000..7bdfcd67 --- /dev/null +++ b/9-regular-expressions/14-regexp-lookahead-lookbehind/2-insert-after-head/task.md @@ -0,0 +1,30 @@ +# Вставьте после фрагмента + +Есть строка с HTML-документом. + +Вставьте после тега `` (у него могут быть атрибуты) строку `

Hello

`. + +Например: + +```js +let reg = /ваше регулярное выражение/; + +let str = ` + + + ... + + +`; + +str = str.replace(reg, `

Hello

`); +``` + +После этого значение `str`: +```html + +

Hello

+ ... + + +``` diff --git a/9-regular-expressions/14-regexp-lookahead-lookbehind/article.md b/9-regular-expressions/14-regexp-lookahead-lookbehind/article.md index 8e36fb0b..1115c502 100644 --- a/9-regular-expressions/14-regexp-lookahead-lookbehind/article.md +++ b/9-regular-expressions/14-regexp-lookahead-lookbehind/article.md @@ -1,54 +1,82 @@ # Lookahead and lookbehind -Sometimes we need to match a pattern only if followed by another pattern. For instance, we'd like to get the price from a string like `subject:1 turkey costs 30€`. +Sometimes we need to find only those matches for a pattern that are followed or preceeded by another pattern. -We need a number (let's say a price has no decimal point) followed by `subject:€` sign. +There's a special syntax for that, called "lookahead" and "lookbehind", together referred to as "lookaround". -That's what lookahead is for. +For the start, let's find the price from the string like `subject:1 turkey costs 30€`. That is: a number, followed by `subject:€` sign. ## Lookahead -The syntax is: `pattern:x(?=y)`, it means "look for `pattern:x`, but match only if followed by `pattern:y`". +The syntax is: `pattern:X(?=Y)`, it means "look for `pattern:X`, but match only if followed by `pattern:Y`". There may be any pattern instead of `pattern:X` and `pattern:Y`. -For an integer amount followed by `subject:€`, the regexp will be `pattern:\d+(?=€)`: +For an integer number followed by `subject:€`, the regexp will be `pattern:\d+(?=€)`: ```js run let str = "1 turkey costs 30€"; -alert( str.match(/\d+(?=€)/) ); // 30 (correctly skipped the sole number 1) +alert( str.match(/\d+(?=€)/) ); // 30, the number 1 is ignored, as it's not followed by € ``` -Let's say we want a quantity instead, that is a number, NOT followed by `subject:€`. +Please note: the lookahead is merely a test, the contents of the parentheses `pattern:(?=...)` is not included in the result `match:30`. -Here a negative lookahead can be applied. +When we look for `pattern:X(?=Y)`, the regular expression engine finds `pattern:X` and then checks if there's `pattern:Y` immediately after it. If it's not so, then the potential match is skipped, and the search continues. -The syntax is: `pattern:x(?!y)`, it means "search `pattern:x`, but only if not followed by `pattern:y`". +More complex tests are possible, e.g. `pattern:X(?=Y)(?=Z)` means: + +1. Find `pattern:X`. +2. Check if `pattern:Y` is immediately after `pattern:X` (skip if isn't). +3. Check if `pattern:Z` is immediately after `pattern:X` (skip if isn't). +4. If both tests passed, then it's the match. + +In other words, such pattern means that we're looking for `pattern:X` followed by `pattern:Y` and `pattern:Z` at the same time. + +That's only possible if patterns `pattern:Y` and `pattern:Z` aren't mutually exclusive. + +For example, `pattern:\d+(?=\s)(?=.*30)` looks for `pattern:\d+` only if it's followed by a space, and there's `30` somewhere after it: + +```js run +let str = "1 turkey costs 30€"; + +alert( str.match(/\d+(?=\s)(?=.*30)/) ); // 1 +``` + +In our string that exactly matches the number `1`. + +## Negative lookahead + +Let's say that we want a quantity instead, not a price from the same string. That's a number `pattern:\d+`, NOT followed by `subject:€`. + +For that, a negative lookahead can be applied. + +The syntax is: `pattern:X(?!Y)`, it means "search `pattern:X`, but only if not followed by `pattern:Y`". ```js run let str = "2 turkeys cost 60€"; -alert( str.match(/\d+(?!€)/) ); // 2 (correctly skipped the price) +alert( str.match(/\d+(?!€)/) ); // 2 (the price is skipped) ``` ## Lookbehind -Lookahead allows to add a condition for "what goes after". +Lookahead allows to add a condition for "what follows". -Lookbehind is similar, but it looks behind. That is, it allows to match a pattern only if there's something before. +Lookbehind is similar, but it looks behind. That is, it allows to match a pattern only if there's something before it. The syntax is: -- Positive lookbehind: `pattern:(?<=y)x`, matches `pattern:x`, but only if it follows after `pattern:y`. -- Negative lookbehind: `pattern:(? + +```js run +let reg = /^(\d+)*$/; + +let str = "012345678901234567890123456789!"; + +// will take a very long time +alert( reg.test(str) ); +``` + +So what's wrong with the regexp? + +First, one may notice that the regexp `pattern:(\d+)*` is a little bit strange. The quantifier `pattern:*` looks extraneous. If we want a number, we can use `pattern:\d+`. + +Indeed, the regexp is artificial. But the reason why it is slow is the same as those we saw above. So let's understand it, and then the previous example will become obvious. + +What happens during the search of `pattern:^(\d+)*$` in the line `subject:123456789!` (shortened a bit for clarity), why does it take so long? + +1. First, the regexp engine tries to find a number `pattern:\d+`. The plus `pattern:+` is greedy by default, so it consumes all digits: + + ``` + \d+....... + (123456789)z + ``` + + Then it tries to apply the star quantifier, but there are no more digits, so it the star doesn't give anything. + + The next in the pattern is the string end `pattern:$`, but in the text we have `subject:!`, so there's no match: + + ``` + X + \d+........$ + (123456789)! + ``` + +2. As there's no match, the greedy quantifier `pattern:+` decreases the count of repetitions, backtracks one character back. + + Now `pattern:\d+` takes all digits except the last one: + ``` + \d+....... + (12345678)9! + ``` +3. Then the engine tries to continue the search from the new position (`9`). + + The star `pattern:(\d+)*` can be applied -- it gives the number `match:9`: + + ``` + + \d+.......\d+ + (12345678)(9)! + ``` + + The engine tries to match `pattern:$` again, but fails, because meets `subject:!`: + + ``` + X + \d+.......\d+ + (12345678)(9)z + ``` + + +4. There's no match, so the engine will continue backtracking, decreasing the number of repetitions. Backtracking generally works like this: the last greedy quantifier decreases the number of repetitions until it can. Then the previous greedy quantifier decreases, and so on. + + All possible combinations are attempted. Here are their examples. + + The first number `pattern:\d+` has 7 digits, and then a number of 2 digits: + + ``` + X + \d+......\d+ + (1234567)(89)! + ``` + + The first number has 7 digits, and then two numbers of 1 digit each: + + ``` + X + \d+......\d+\d+ + (1234567)(8)(9)! + ``` + + The first number has 6 digits, and then a number of 3 digits: + + ``` + X + \d+.......\d+ + (123456)(789)! + ``` + + The first number has 6 digits, and then 2 numbers: + + ``` + X + \d+.....\d+ \d+ + (123456)(78)(9)! + ``` + + ...And so on. + + +There are many ways to split a set of digits `123456789` into numbers. To be precise, there are 2n-1, where `n` is the length of the set. + +For `n=20` there are about 1 million combinations, for `n=30` - a thousand times more. Trying each of them is exactly the reason why the search takes so long. + +What to do? + +Should we turn on the lazy mode? + +Unfortunately, that won't help: if we replace `pattern:\d+` with `pattern:\d+?`, the regexp will still hang. The order of combinations will change, but not their total count. + +Some regular expression engines have tricky tests and finite automations that allow to avoid going through all combinations or make it much faster, but not all engines, and not in all cases. + +## Back to words and strings + +The similar thing happens in our first example, when we look words by pattern `pattern:^(\w+\s?)*$` in the string `subject:An input that hangs!`. + +The reason is that a word can be represented as one `pattern:\w+` or many: + +``` +(input) +(inpu)(t) +(inp)(u)(t) +(in)(p)(ut) +... +``` + +For a human, it's obvious that there may be no match, because the string ends with an exclamation sign `!`, but the regular expression expects a wordly character `pattern:\w` or a space `pattern:\s` at the end. But the engine doesn't know that. + +It tries all combinations of how the regexp `pattern:(\w+\s?)*` can "consume" the string, including variants with spaces `pattern:(\w+\s)*` and without them `pattern:(\w+)*` (because spaces `pattern:\s?` are optional). As there are many such combinations, the search takes a lot of time. + +## How to fix? + +There are two main approaches to fixing the problem. + +The first is to lower the number of possible combinations. + +Let's rewrite the regular expression as `pattern:^(\w+\s)*\w*` - we'll look for any number of words followed by a space `pattern:(\w+\s)*`, and then (optionally) a word `pattern:\w*`. + +This regexp is equivalent to the previous one (matches the same) and works well: + +```js run +let reg = /^(\w+\s)*\w*$/; +let str = "An input string that takes a long time or even makes this regex to hang!"; + +alert( reg.test(str) ); // false +``` + +Why did the problem disappear? + +Now the star `pattern:*` goes after `pattern:\w+\s` instead of `pattern:\w+\s?`. It became impossible to represent one word of the string with multiple successive `pattern:\w+`. The time needed to try such combinations is now saved. + +For example, the previous pattern `pattern:(\w+\s?)*` could match the word `subject:string` as two `pattern:\w+`: + +```js run +\w+\w+ +string +``` + +The previous pattern, due to the optional `pattern:\s` allowed variants `pattern:\w+`, `pattern:\w+\s`, `pattern:\w+\w+` and so on. + +With the rewritten pattern `pattern:(\w+\s)*`, that's impossible: there may be `pattern:\w+\s` or `pattern:\w+\s\w+\s`, but not `pattern:\w+\w+`. So the overall combinations count is greatly decreased. + +## Preventing backtracking + +It's not always convenient to rewrite a regexp. And it's not always obvious how to do it. + +The alternative approach is to forbid backtracking for the quantifier. + +The regular expressions engine tries many combinations that are obviously wrong for a human. + +E.g. in the regexp `pattern:(\d+)*$` it's obvious for a human, that `pattern:+` shouldn't backtrack. If we replace one `pattern:\d+` with two separate `pattern:\d+\d+`, nothing changes: + +``` +\d+........ +(123456789)! + +\d+...\d+.... +(1234)(56789)! +``` + +And in the original example `pattern:^(\w+\s?)*$` we may want to forbid backtracking in `pattern:\w+`. That is: `pattern:\w+` should match a whole word, with the maximal possible length. There's no need to lower the repetitions count in `pattern:\w+`, try to split it into two words `pattern:\w+\w+` and so on. + +Modern regular expression engines support possessive quantifiers for that. They are like greedy ones, but don't backtrack (so they are actually simpler than regular quantifiers). + +There are also so-called "atomic capturing groups" - a way to disable backtracking inside parentheses. + +Unfortunately, in JavaScript they are not supported. But there's another way. + +### Lookahead to the rescue! + +We can prevent backtracking using lookahead. + +The pattern to take as much repetitions of `pattern:\w` as possible without backtracking is: `pattern:(?=(\w+))\1`. + +Let's decipher it: +- Lookahead `pattern:?=` looks forward for the longest word `pattern:\w+` starting at the current position. +- The contents of parentheses with `pattern:?=...` isn't memorized by the engine, so wrap `pattern:\w+` into parentheses. Then the engine will memorize their contents +- ...And allow us to reference it in the pattern as `pattern:\1`. + +That is: we look ahead - and if there's a word `pattern:\w+`, then match it as `pattern:\1`. + +Why? That's because the lookahead finds a word `pattern:\w+` as a whole and we capture it into the pattern with `pattern:\1`. So we essentially implemented a possessive plus `pattern:+` quantifier. It captures only the whole word `pattern:\w+`, not a part of it. + +For instance, in the word `subject:JavaScript` it may not only match `match:Java`, but leave out `match:Script` to match the rest of the pattern. + +Here's the comparison of two patterns: + +```js run +alert( "JavaScript".match(/\w+Script/)); // JavaScript +alert( "JavaScript".match(/(?=(\w+))\1Script/)); // null +``` + +1. In the first variant `pattern:\w+` first captures the whole word `subject:JavaScript` but then `pattern:+` backtracks character by character, to try to match the rest of the pattern, until it finally succeeds (when `pattern:\w+` matches `match:Java`). +2. In the second variant `pattern:(?=(\w+))` looks ahead and finds the word `subject:JavaScript`, that is included into the pattern as a whole by `pattern:\1`, so there remains no way to find `subject:Script` after it. + +We can put a more complex regular expression into `pattern:(?=(\w+))\1` instead of `pattern:\w`, when we need to forbid backtracking for `pattern:+` after it. + +```smart +There's more about the relation between possessive quantifiers and lookahead in articles [Regex: Emulate Atomic Grouping (and Possessive Quantifiers) with LookAhead](http://instanceof.me/post/52245507631/regex-emulate-atomic-grouping-with-lookahead) and [Mimicking Atomic Groups](http://blog.stevenlevithan.com/archives/mimic-atomic-groups). +``` + +Let's rewrite the first example using lookahead to prevent backtracking: + +```js run +let reg = /^((?=(\w+))\2\s?)*$/; + +alert( reg.test("A good string") ); // true + +let str = "An input string that takes a long time or even makes this regex to hang!"; + +alert( reg.test(str) ); // false, works and fast! +``` + +Here `pattern:\2` is used instead of `pattern:\1`, because there are additional outer parentheses. To avoid messing up with the numbers, we can give the parentheses a name, e.g. `pattern:(?\w+)`. + +```js run +// parentheses are named ?, referenced as \k +let reg = /^((?=(?\w+))\k\s?)*$/; + +let str = "An input string that takes a long time or even makes this regex to hang!"; + +alert( reg.test(str) ); // false + +alert( reg.test("A correct string") ); // true +``` + +The problem described in this article is called "catastrophic backtracking". + +We covered two ways how to solve it: +- Rewrite the regexp to lower the possible combinations count. +- Prevent backtracking. diff --git a/9-regular-expressions/15-regexp-infinite-backtracking-problem/article.md b/9-regular-expressions/15-regexp-infinite-backtracking-problem/article.md deleted file mode 100644 index 67f3e93c..00000000 --- a/9-regular-expressions/15-regexp-infinite-backtracking-problem/article.md +++ /dev/null @@ -1,297 +0,0 @@ -# Infinite backtracking problem - -Some regular expressions are looking simple, but can execute veeeeeery long time, and even "hang" the JavaScript engine. - -Sooner or later most developers occasionally face such behavior. - -The typical situation -- a regular expression works fine sometimes, but for certain strings it "hangs" consuming 100% of CPU. - -In a web-browser it kills the page. Not a good thing for sure. - -For server-side JavaScript it may become a vulnerability, and it uses regular expressions to process user data. Bad input will make the process hang, causing denial of service. The author personally saw and reported such vulnerabilities even for very well-known and widely used programs. - -So the problem is definitely worth to deal with. - -## Introduction - -The plan will be like this: - -1. First we see the problem how it may occur. -2. Then we simplify the situation and see why it occurs. -3. Then we fix it. - -For instance let's consider searching tags in HTML. - -We want to find all tags, with or without attributes -- like `subject:`. We need the regexp to work reliably, because HTML comes from the internet and can be messy. - -In particular, we need it to match tags like `` -- with `<` and `>` in attributes. That's allowed by [HTML standard](https://html.spec.whatwg.org/multipage/syntax.html#syntax-attributes). - -A simple regexp like `pattern:<[^>]+>` doesn't work, because it stops at the first `>`, and we need to ignore `<>` if inside an attribute: - -```js run -// the match doesn't reach the end of the tag - wrong! -alert( ''.match(/<[^>]+>/) ); // `. - -That regexp is not perfect! It doesn't support all the details of HTML syntax, such as unquoted values, and there are other ways to improve, but let's not add complexity. It will demonstrate the problem for us. - -The regexp seems to work: - -```js run -let reg = /<\w+(\s*\w+="[^"]*"\s*)*>/g; - -let str='...... ...'; - -alert( str.match(reg) ); // , -``` - -Great! It found both the long tag `match:` and the short one `match:`. - -Now, that we've got a seemingly working solution, let's get to the infinite backtracking itself. - -## Infinite backtracking - -If you run our regexp on the input below, it may hang the browser (or another JavaScript host): - -```js run -let reg = /<\w+(\s*\w+="[^"]*"\s*)*>/g; - -let str = ``. - -Unfortunately, the regexp still hangs: - -```js run -// only search for space-delimited attributes -let reg = /<(\s*\w+=\w+\s*)*>/g; - -let str = `` in the string `subject:` at the end, so the match is impossible, but the regexp engine doesn't know about it. The search backtracks trying different combinations of `pattern:(\s*\w+=\w+\s*)`: - -``` -(a=b a=b a=b) (a=b) -(a=b a=b) (a=b a=b) -(a=b) (a=b a=b a=b) -... -``` - -As there are many combinations, it takes a lot of time. - -## How to fix? - -The backtracking checks many variants that are an obvious fail for a human. - -For instance, in the pattern `pattern:(\d+)*$` a human can easily see that `pattern:(\d+)*` does not need to backtrack `pattern:+`. There's no difference between one or two `\d+`: - -``` -\d+........ -(123456789)z - -\d+...\d+.... -(1234)(56789)z -``` - -Let's get back to more real-life example: `pattern:<(\s*\w+=\w+\s*)*>`. We want it to find pairs `name=value` (as many as it can). - -What we would like to do is to forbid backtracking. - -There's totally no need to decrease the number of repetitions. - -In other words, if it found three `name=value` pairs and then can't find `>` after them, then there's no need to decrease the count of repetitions. There are definitely no `>` after those two (we backtracked one `name=value` pair, it's there): - -``` -(name=value) name=value -``` - -Modern regexp engines support so-called "possessive" quantifiers for that. They are like greedy, but don't backtrack at all. Pretty simple, they capture whatever they can, and the search continues. There's also another tool called "atomic groups" that forbid backtracking inside parentheses. - -Unfortunately, but both these features are not supported by JavaScript. - -### Lookahead to the rescue - -We can forbid backtracking using lookahead. - -The pattern to take as much repetitions as possible without backtracking is: `pattern:(?=(a+))\1`. - -In other words: -- The lookahead `pattern:?=` looks for the maximal count `pattern:a+` from the current position. -- And then they are "consumed into the result" by the backreference `pattern:\1` (`pattern:\1` corresponds to the content of the second parentheses, that is `pattern:a+`). - -There will be no backtracking, because lookahead does not backtrack. If, for -example, it found 5 instances of `pattern:a+` and the further match failed, -it won't go back to the 4th instance. - -```smart -There's more about the relation between possessive quantifiers and lookahead in articles [Regex: Emulate Atomic Grouping (and Possessive Quantifiers) with LookAhead](http://instanceof.me/post/52245507631/regex-emulate-atomic-grouping-with-lookahead) and [Mimicking Atomic Groups](http://blog.stevenlevithan.com/archives/mimic-atomic-groups). -``` - -So this trick makes the problem disappear. - -Let's fix the regexp for a tag with attributes from the beginning of the chapter`pattern:<\w+(\s*\w+=(\w+|"[^"]*")\s*)*>`. We'll use lookahead to prevent backtracking of `name=value` pairs: - -```js run -// regexp to search name=value -let attrReg = /(\s*\w+=(\w+|"[^"]*")\s*)/ - -// use new RegExp to nicely insert its source into (?=(a+))\1 -let fixedReg = new RegExp(`<\\w+(?=(${attrReg.source}*))\\1>`, 'g'); - -let goodInput = '...... ...'; - -let badInput = `, -alert( badInput.match(fixedReg) ); // null (no results, fast!) -``` - -Great, it works! We found both a long tag `match:` and a small one `match:`, and (!) didn't hang the engine on the bad input. - -Please note the `attrReg.source` property. `RegExp` objects provide access to their source string in it. That's convenient when we want to insert one regexp into another.