This commit is contained in:
Ilya Kantor 2019-09-06 01:15:24 +03:00
parent 20547570ff
commit 681cae4b6a
16 changed files with 505 additions and 362 deletions

View file

@ -0,0 +1,21 @@
A two-digit hex number is `pattern:[0-9a-f]{2}` (assuming the flag `pattern:i` is set).
We need that number `NN`, and then `:NN` repeated 5 times (more numbers);
The regexp is: `pattern:[0-9a-f]{2}(:[0-9a-f]{2}){5}`
Now let's show that the match should capture all the text: start at the beginning and end at the end. That's done by wrapping the pattern in `pattern:^...$`.
Finally:
```js run
let reg = /^[0-9a-fA-F]{2}(:[0-9a-fA-F]{2}){5}$/i;
alert( reg.test('01:32:54:67:89:AB') ); // true
alert( reg.test('0132546789AB') ); // false (no colons)
alert( reg.test('01:32:54:67:89') ); // false (5 numbers, need 6)
alert( reg.test('01:32:54:67:89:ZZ') ) // false (ZZ in the end)
```

View file

@ -0,0 +1,20 @@
# Check MAC-address
[MAC-address](https://en.wikipedia.org/wiki/MAC_address) of a network interface consists of 6 two-digit hex numbers separated by a colon.
For instance: `subject:'01:32:54:67:89:AB'`.
Write a regexp that checks whether a string is MAC-address.
Usage:
```js
let reg = /your regexp/;
alert( reg.test('01:32:54:67:89:AB') ); // true
alert( reg.test('0132546789AB') ); // false (no colons)
alert( reg.test('01:32:54:67:89') ); // false (5 numbers, must be 6)
alert( reg.test('01:32:54:67:89:ZZ') ) // false (ZZ ad the end)
```

View file

@ -65,7 +65,7 @@ That regexp is not perfect, but mostly works and helps to fix accidental mistype
## Parentheses contents in the match ## Parentheses contents in the match
Parentheses are numbered from left to right. The search engine remembers the content matched by each of them and allows to get it in the result. Parentheses are numbered from left to right. The search engine memorizes the content matched by each of them and allows to get it in the result.
The method `str.match(regexp)`, if `regexp` has no flag `g`, looks for the first match and returns it as an array: The method `str.match(regexp)`, if `regexp` has no flag `g`, looks for the first match and returns it as an array:
@ -347,4 +347,4 @@ If the parentheses have no name, then their contents is available in the match a
We can also use parentheses contents in the replacement string in `str.replace`: by the number `$n` or the name `$<name>`. We can also use parentheses contents in the replacement string in `str.replace`: by the number `$n` or the name `$<name>`.
A group may be excluded from remembering by adding `pattern:?:` in its start. That's used when we need to apply a quantifier to the whole group, but don't remember it as a separate item in the results array. We also can't reference such parentheses in the replacement string. A group may be excluded from numbering by adding `pattern:?:` in its start. That's used when we need to apply a quantifier to the whole group, but don't want it as a separate item in the results array. We also can't reference such parentheses in the replacement string.

View file

@ -1,31 +1,31 @@
# Backreferences in pattern: \n and \k # Backreferences in pattern: \N and \k<name>
We can use the contents of capturing groups `(...)` not only in the result or in the replacement string, but also in the pattern itself. We can use the contents of capturing groups `pattern:(...)` not only in the result or in the replacement string, but also in the pattern itself.
## Backreference by number: \n ## Backreference by number: \N
A group can be referenced in the pattern using `\n`, where `n` is the group number. A group can be referenced in the pattern using `pattern:\N`, where `N` is the group number.
To make things clear let's consider a task. To make clear why that's helpful, let's consider a task.
We need to find a quoted string: either a single-quoted `subject:'...'` or a double-quoted `subject:"..."` -- both variants need to match. We need to find quoted strings: either single-quoted `subject:'...'` or a double-quoted `subject:"..."` -- both variants should match.
How to look for them? How to find them?
We can put both kinds of quotes in the square brackets: `pattern:['"](.*?)['"]`, but it would find strings with mixed quotes, like `match:"...'` and `match:'..."`. That would lead to incorrect matches when one quote appears inside other ones, like the string `subject:"She's the one!"`: We can put both kinds of quotes in the square brackets: `pattern:['"](.*?)['"]`, but it would find strings with mixed quotes, like `match:"...'` and `match:'..."`. That would lead to incorrect matches when one quote appears inside other ones, like in the string `subject:"She's the one!"`:
```js run ```js run
let str = `He said: "She's the one!".`; let str = `He said: "She's the one!".`;
let reg = /['"](.*?)['"]/g; let reg = /['"](.*?)['"]/g;
// The result is not what we expect // The result is not what we'd like to have
alert( str.match(reg) ); // "She' alert( str.match(reg) ); // "She'
``` ```
As we can see, the pattern found an opening quote `match:"`, then the text is consumed lazily till the other quote `match:'`, that closes the match. As we can see, the pattern found an opening quote `match:"`, then the text is consumed till the other quote `match:'`, that closes the match.
To make sure that the pattern looks for the closing quote exactly the same as the opening one, we can wrap it into a capturing group and use the backreference. To make sure that the pattern looks for the closing quote exactly the same as the opening one, we can wrap it into a capturing group and backreference it: `pattern:(['"])(.*?)\1`.
Here's the correct code: Here's the correct code:
@ -39,20 +39,27 @@ let reg = /(['"])(.*?)\1/g;
alert( str.match(reg) ); // "She's the one!" alert( str.match(reg) ); // "She's the one!"
``` ```
Now it works! The regular expression engine finds the first quote `pattern:(['"])` and remembers the content of `pattern:(...)`, that's the first capturing group. Now it works! The regular expression engine finds the first quote `pattern:(['"])` and memorizes its content. That's the first capturing group.
Further in the pattern `pattern:\1` means "find the same text as in the first group", exactly the same quote in our case. Further in the pattern `pattern:\1` means "find the same text as in the first group", exactly the same quote in our case.
Please note: Similar to that, `pattern:\2` would mean the contents of the second group, `pattern:\3` - the 3rd group, and so on.
- To reference a group inside a replacement string -- we use `$1`, while in the pattern -- a backslash `\1`. ```smart
- If we use `?:` in the group, then we can't reference it. Groups that are excluded from capturing `(?:...)` are not remembered by the engine. If we use `?:` in the group, then we can't reference it. Groups that are excluded from capturing `(?:...)` are not memorized by the engine.
```
```warn header="Don't mess up: in the pattern `pattern:\1`, in the replacement: `pattern:$1`"
In the replacement string we use a dollar sign: `pattern:$1`, while in the pattern - a backslash `pattern:\1`.
```
## Backreference by name: `\k<name>` ## Backreference by name: `\k<name>`
For named groups, we can backreference by `\k<name>`. If a regexp has many parentheses, it's convenient to give them names.
The same example with the named group: To reference a named group we can use `pattern:\k<имя>`.
In the example below the group with quotes is named `pattern:?<quote>`, so the backreference is `pattern:\k<quote>`:
```js run ```js run
let str = `He said: "She's the one!".`; let str = `He said: "She's the one!".`;

View file

@ -18,7 +18,7 @@ let str = "First HTML appeared, then CSS, then JavaScript";
alert( str.match(reg) ); // 'HTML', 'CSS', 'JavaScript' alert( str.match(reg) ); // 'HTML', 'CSS', 'JavaScript'
``` ```
We already know a similar thing -- square brackets. They allow to choose between multiple character, for instance `pattern:gr[ae]y` matches `match:gray` or `match:grey`. We already saw a similar thing -- square brackets. They allow to choose between multiple characters, for instance `pattern:gr[ae]y` matches `match:gray` or `match:grey`.
Square brackets allow only characters or character sets. Alternation allows any expressions. A regexp `pattern:A|B|C` means one of expressions `A`, `B` or `C`. Square brackets allow only characters or character sets. Alternation allows any expressions. A regexp `pattern:A|B|C` means one of expressions `A`, `B` or `C`.
@ -27,30 +27,41 @@ For instance:
- `pattern:gr(a|e)y` means exactly the same as `pattern:gr[ae]y`. - `pattern:gr(a|e)y` means exactly the same as `pattern:gr[ae]y`.
- `pattern:gra|ey` means `match:gra` or `match:ey`. - `pattern:gra|ey` means `match:gra` or `match:ey`.
To separate a part of the pattern for alternation we usually enclose it in parentheses, like this: `pattern:before(XXX|YYY)after`. To apply alternation to a chosen part of the pattern, we can enclose it in parentheses:
- `pattern:I love HTML|CSS` matches `match:I love HTML` or `match:CSS`.
- `pattern:I love (HTML|CSS)` matches `match:I love HTML` or `match:I love CSS`.
## Regexp for time ## Example: regexp for time
In previous chapters there was a task to build a regexp for searching time in the form `hh:mm`, for instance `12:00`. But a simple `pattern:\d\d:\d\d` is too vague. It accepts `25:99` as the time (as 99 seconds match the pattern). In previous articles there was a task to build a regexp for searching time in the form `hh:mm`, for instance `12:00`. But a simple `pattern:\d\d:\d\d` is too vague. It accepts `25:99` as the time (as 99 seconds match the pattern, but that time is invalid).
How can we make a better one? How can we make a better pattern?
We can apply more careful matching. First, the hours: We can use more careful matching. First, the hours:
- If the first digit is `0` or `1`, then the next digit can by anything. - If the first digit is `0` or `1`, then the next digit can be any: `pattern:[01]\d`.
- Or, if the first digit is `2`, then the next must be `pattern:[0-3]`. - Otherwise, if the first digit is `2`, then the next must be `pattern:[0-3]`.
- (no other first digit is allowed)
As a regexp: `pattern:[01]\d|2[0-3]`. We can write both variants in a regexp using alternation: `pattern:[01]\d|2[0-3]`.
Next, the minutes must be from `0` to `59`. In the regexp language that means `pattern:[0-5]\d`: the first digit `0-5`, and then any digit. Next, minutes must be from `00` to `59`. In the regular expression language that can be written as `pattern:[0-5]\d`: the first digit `0-5`, and then any digit.
Let's glue them together into the pattern: `pattern:[01]\d|2[0-3]:[0-5]\d`. If we glue minutes and seconds together, we get the pattern: `pattern:[01]\d|2[0-3]:[0-5]\d`.
We're almost done, but there's a problem. The alternation `pattern:|` now happens to be between `pattern:[01]\d` and `pattern:2[0-3]:[0-5]\d`. We're almost done, but there's a problem. The alternation `pattern:|` now happens to be between `pattern:[01]\d` and `pattern:2[0-3]:[0-5]\d`.
That's wrong, as it should be applied only to hours `[01]\d` OR `2[0-3]`. That's a common mistake when starting to work with regular expressions. That is: minutes are added to the second alternation variant, here's a clear picture:
The correct variant: ```
[01]\d | 2[0-3]:[0-5]\d
```
That pattern looks for `pattern:[01]\d` or `pattern:2[0-3]:[0-5]\d`.
But that's wrong, the alternation should only be used in the "hours" part of the regular expression, to allow `pattern:[01]\d` OR `pattern:2[0-3]`. Let's correct that by enclosing "hours" into parentheses: `pattern:([01]\d|2[0-3]):[0-5]\d`.
The final solution:
```js run ```js run
let reg = /([01]\d|2[0-3]):[0-5]\d/g; let reg = /([01]\d|2[0-3]):[0-5]\d/g;

View file

@ -0,0 +1,29 @@
Для того, чтобы вставить после тега `<body>`, нужно вначале его найти. Будем использовать регулярное выражение `pattern:<body.*>`.
Далее, нам нужно оставить сам тег `<body>` на месте и добавить текст после него.
Это можно сделать вот так:
```js run
let str = '...<body style="...">...';
str = str.replace(/<body.*>/, '$&<h1>Hello</h1>');
alert(str); // ...<body style="..."><h1>Hello</h1>...
```
В строке замены `$&` означает само совпадение, то есть мы заменяем `pattern:<body.*>` заменяется на самого себя плюс `<h1>Hello</h1>`.
Альтернативный вариант - использовать ретроспективную проверку:
```js run
let str = '...<body style="...">...';
str = str.replace(/(?<=<body.*>)/, `<h1>Hello</h1>`);
alert(str); // ...<body style="..."><h1>Hello</h1>...
```
Такое регулярное выражение на каждой позиции будет проверять, не идёт ли прямо перед ней `pattern:<body.*>`. Если да - совпадение найдено. Но сам тег `pattern:<body.*>` в совпадение не входит, он только участвует в проверке. А других символов после проверки в нём нет, так что текст совпадения будет пустым.
Происходит замена "пустой строки", перед которой идёт `pattern:<body.*>` на `<h1>Hello</h1>`. Что, как раз, и есть вставка этой строки после `<body>`.
P.S. Этому регулярному выражению не помешают флаги: `pattern:/<body.*>/si`, чтобы в "точку" входил перевод строки (тег может занимать несколько строк), а также чтобы теги в другом регистре типа `match:<BODY>` тоже находились.

View file

@ -0,0 +1,30 @@
# Вставьте после фрагмента
Есть строка с HTML-документом.
Вставьте после тега `<body>` (у него могут быть атрибуты) строку `<h1>Hello</h1>`.
Например:
```js
let reg = /ваше регулярное выражение/;
let str = `
<html>
<body style="height: 200px">
...
</body>
</html>
`;
str = str.replace(reg, `<h1>Hello</h1>`);
```
После этого значение `str`:
```html
<html>
<body style="height: 200px"><h1>Hello</h1>
...
</body>
</html>
```

View file

@ -1,54 +1,82 @@
# Lookahead and lookbehind # Lookahead and lookbehind
Sometimes we need to match a pattern only if followed by another pattern. For instance, we'd like to get the price from a string like `subject:1 turkey costs 30€`. Sometimes we need to find only those matches for a pattern that are followed or preceeded by another pattern.
We need a number (let's say a price has no decimal point) followed by `subject:€` sign. There's a special syntax for that, called "lookahead" and "lookbehind", together referred to as "lookaround".
That's what lookahead is for. For the start, let's find the price from the string like `subject:1 turkey costs 30€`. That is: a number, followed by `subject:€` sign.
## Lookahead ## Lookahead
The syntax is: `pattern:x(?=y)`, it means "look for `pattern:x`, but match only if followed by `pattern:y`". The syntax is: `pattern:X(?=Y)`, it means "look for `pattern:X`, but match only if followed by `pattern:Y`". There may be any pattern instead of `pattern:X` and `pattern:Y`.
For an integer amount followed by `subject:€`, the regexp will be `pattern:\d+(?=€)`: For an integer number followed by `subject:€`, the regexp will be `pattern:\d+(?=€)`:
```js run ```js run
let str = "1 turkey costs 30€"; let str = "1 turkey costs 30€";
alert( str.match(/\d+(?=€)/) ); // 30 (correctly skipped the sole number 1) alert( str.match(/\d+(?=€)/) ); // 30, the number 1 is ignored, as it's not followed by €
``` ```
Let's say we want a quantity instead, that is a number, NOT followed by `subject:€`. Please note: the lookahead is merely a test, the contents of the parentheses `pattern:(?=...)` is not included in the result `match:30`.
Here a negative lookahead can be applied. When we look for `pattern:X(?=Y)`, the regular expression engine finds `pattern:X` and then checks if there's `pattern:Y` immediately after it. If it's not so, then the potential match is skipped, and the search continues.
The syntax is: `pattern:x(?!y)`, it means "search `pattern:x`, but only if not followed by `pattern:y`". More complex tests are possible, e.g. `pattern:X(?=Y)(?=Z)` means:
1. Find `pattern:X`.
2. Check if `pattern:Y` is immediately after `pattern:X` (skip if isn't).
3. Check if `pattern:Z` is immediately after `pattern:X` (skip if isn't).
4. If both tests passed, then it's the match.
In other words, such pattern means that we're looking for `pattern:X` followed by `pattern:Y` and `pattern:Z` at the same time.
That's only possible if patterns `pattern:Y` and `pattern:Z` aren't mutually exclusive.
For example, `pattern:\d+(?=\s)(?=.*30)` looks for `pattern:\d+` only if it's followed by a space, and there's `30` somewhere after it:
```js run
let str = "1 turkey costs 30€";
alert( str.match(/\d+(?=\s)(?=.*30)/) ); // 1
```
In our string that exactly matches the number `1`.
## Negative lookahead
Let's say that we want a quantity instead, not a price from the same string. That's a number `pattern:\d+`, NOT followed by `subject:€`.
For that, a negative lookahead can be applied.
The syntax is: `pattern:X(?!Y)`, it means "search `pattern:X`, but only if not followed by `pattern:Y`".
```js run ```js run
let str = "2 turkeys cost 60€"; let str = "2 turkeys cost 60€";
alert( str.match(/\d+(?!€)/) ); // 2 (correctly skipped the price) alert( str.match(/\d+(?!€)/) ); // 2 (the price is skipped)
``` ```
## Lookbehind ## Lookbehind
Lookahead allows to add a condition for "what goes after". Lookahead allows to add a condition for "what follows".
Lookbehind is similar, but it looks behind. That is, it allows to match a pattern only if there's something before. Lookbehind is similar, but it looks behind. That is, it allows to match a pattern only if there's something before it.
The syntax is: The syntax is:
- Positive lookbehind: `pattern:(?<=y)x`, matches `pattern:x`, but only if it follows after `pattern:y`. - Positive lookbehind: `pattern:(?<=Y)X`, matches `pattern:X`, but only if there's `pattern:Y` before it.
- Negative lookbehind: `pattern:(?<!y)x`, matches `pattern:x`, but only if there's no `pattern:y` before. - Negative lookbehind: `pattern:(?<!Y)X`, matches `pattern:X`, but only if there's no `pattern:Y` before it.
For example, let's change the price to US dollars. The dollar sign is usually before the number, so to look for `$30` we'll use `pattern:(?<=\$)\d+` -- an amount preceded by `subject:$`: For example, let's change the price to US dollars. The dollar sign is usually before the number, so to look for `$30` we'll use `pattern:(?<=\$)\d+` -- an amount preceded by `subject:$`:
```js run ```js run
let str = "1 turkey costs $30"; let str = "1 turkey costs $30";
// the dollar sign is escaped \$
alert( str.match(/(?<=\$)\d+/) ); // 30 (skipped the sole number) alert( str.match(/(?<=\$)\d+/) ); // 30 (skipped the sole number)
``` ```
And, to find the quantity -- a number, not preceded by `subject:$`, we can use a negative lookbehind `pattern:(?<!\$)\d+`: And, if we need the quantity -- a number, not preceded by `subject:$`, then we can use a negative lookbehind `pattern:(?<!\$)\d+`:
```js run ```js run
let str = "2 turkeys cost $60"; let str = "2 turkeys cost $60";
@ -56,15 +84,15 @@ let str = "2 turkeys cost $60";
alert( str.match(/(?<!\$)\d+/) ); // 2 (skipped the price) alert( str.match(/(?<!\$)\d+/) ); // 2 (skipped the price)
``` ```
## Capture groups ## Capturing groups
Generally, what's inside the lookaround (a common name for both lookahead and lookbehind) parentheses does not become a part of the match. Generally, the contents inside lookaround parentheses does not become a part of the result.
E.g. in the pattern `pattern:\d+(?=€)`, the `pattern:€` sign doesn't get captured as a part of the match. That's natural: we look for a number `pattern:\d+`, while `pattern:(?=€)` is just a test that it should be followed by `subject:€`. E.g. in the pattern `pattern:\d+(?=€)`, the `pattern:€` sign doesn't get captured as a part of the match. That's natural: we look for a number `pattern:\d+`, while `pattern:(?=€)` is just a test that it should be followed by `subject:€`.
But in some situations we might want to capture the lookaround expression as well, or a part of it. That's possible. Just wrap that into additional parentheses. But in some situations we might want to capture the lookaround expression as well, or a part of it. That's possible. Just wrap that part into additional parentheses.
For instance, here the currency `pattern:(€|kr)` is captured, along with the amount: In the example below the currency sign `pattern:(€|kr)` is captured, along with the amount:
```js run ```js run
let str = "1 turkey costs 30€"; let str = "1 turkey costs 30€";
@ -82,28 +110,21 @@ let reg = /(?<=(\$|£))\d+/;
alert( str.match(reg) ); // 30, $ alert( str.match(reg) ); // 30, $
``` ```
Please note that for lookbehind the order stays be same, even though lookahead parentheses are before the main pattern.
Usually parentheses are numbered left-to-right, but lookbehind is an exception, it is always captured after the main pattern. So the match for `pattern:\d+` goes in the result first, and then for `pattern:(\$|£)`.
## Summary ## Summary
Lookahead and lookbehind (commonly referred to as "lookaround") are useful when we'd like to take something into the match depending on the context before/after it. Lookahead and lookbehind (commonly referred to as "lookaround") are useful when we'd like to match something depending on the context before/after it.
For simple regexps we can do the similar thing manually. That is: match everything, in any context, and then filter by context in the loop. For simple regexps we can do the similar thing manually. That is: match everything, in any context, and then filter by context in the loop.
Remember, `str.matchAll` and `reg.exec` return matches with `.index` property, so we know where exactly in the text it is, and can check the context. Remember, `str.match` (without flag `pattern:g`) and `str.matchAll` (always) return matches as arrays with `index` property, so we know where exactly in the text it is, and can check the context.
But generally regular expressions are more convenient. But generally lookaround is more convenient.
Lookaround types: Lookaround types:
| Pattern | type | matches | | Pattern | type | matches |
|--------------------|------------------|---------| |--------------------|------------------|---------|
| `pattern:x(?=y)` | Positive lookahead | `x` if followed by `pattern:y` | | `X(?=Y)` | Positive lookahead | `pattern:X` if followed by `pattern:Y` |
| `pattern:x(?!y)` | Negative lookahead | `x` if not followed by `pattern:y` | | `X(?!Y)` | Negative lookahead | `pattern:X` if not followed by `pattern:Y` |
| `pattern:(?<=y)x` | Positive lookbehind | `x` if after `pattern:y` | | `(?<=Y)X` | Positive lookbehind | `pattern:X` if after `pattern:Y` |
| `pattern:(?<!y)x` | Negative lookbehind | `x` if not after `pattern:y` | | `(?<!Y)X` | Negative lookbehind | `pattern:X` if not after `pattern:Y` |
Lookahead can also used to disable backtracking. Why that may be needed and other details -- see in the next chapter.

View file

@ -0,0 +1,301 @@
# Catastrophic backtracking
Some regular expressions are looking simple, but can execute veeeeeery long time, and even "hang" the JavaScript engine.
Sooner or later most developers occasionally face such behavior, because it's quite easy to create such a regexp.
The typical symptom -- a regular expression works fine sometimes, but for certain strings it "hangs", consuming 100% of CPU.
In such case a web-browser suggests to kill the script and reload the page. Not a good thing for sure.
For server-side JavaScript it may become a vulnerability if regular expressions process user data.
## Example
Let's say we have a string, and we'd like to check if it consists of words `pattern:\w+` with an optional space `pattern:\s?` after each.
We'll use a regexp `pattern:^(\w+\s?)*$`, it specifies 0 or more such words.
In action:
```js run
let reg = /^(\w+\s?)*$/;
alert( reg.test("A good string") ); // true
alert( reg.test("Bad characters: $@#") ); // false
```
It seems to work. The result is correct. Although, on certain strings it takes a lot of time. So long that JavaScript engine "hangs" with 100% CPU consumption.
If you run the example below, you probably won't see anything, as JavaScript will just "hang". A web-browser will stop reacting on events, the UI will stop working. After some time it will suggest to reloaad the page. So be careful with this:
```js run
let reg = /^(\w+\s?)*$/;
let str = "An input string that takes a long time or even makes this regexp to hang!";
// will take a very long time
alert( reg.test(str) );
```
Some regular expression engines can handle such search, but most of them can't.
## Simplified example
What's the matter? Why the regular expression "hangs"?
To understand that, let's simplify the example: remove spaces `pattern:\s?`. Then it becomes `pattern:^(\w+)*$`.
And, to make things more obvious, let's replace `pattern:\w` with `pattern:\d`. The resulting regular expression still hangs, for instance:
<!-- let str = `AnInputStringThatMakesItHang!`; -->
```js run
let reg = /^(\d+)*$/;
let str = "012345678901234567890123456789!";
// will take a very long time
alert( reg.test(str) );
```
So what's wrong with the regexp?
First, one may notice that the regexp `pattern:(\d+)*` is a little bit strange. The quantifier `pattern:*` looks extraneous. If we want a number, we can use `pattern:\d+`.
Indeed, the regexp is artificial. But the reason why it is slow is the same as those we saw above. So let's understand it, and then the previous example will become obvious.
What happens during the search of `pattern:^(\d+)*$` in the line `subject:123456789!` (shortened a bit for clarity), why does it take so long?
1. First, the regexp engine tries to find a number `pattern:\d+`. The plus `pattern:+` is greedy by default, so it consumes all digits:
```
\d+.......
(123456789)z
```
Then it tries to apply the star quantifier, but there are no more digits, so it the star doesn't give anything.
The next in the pattern is the string end `pattern:$`, but in the text we have `subject:!`, so there's no match:
```
X
\d+........$
(123456789)!
```
2. As there's no match, the greedy quantifier `pattern:+` decreases the count of repetitions, backtracks one character back.
Now `pattern:\d+` takes all digits except the last one:
```
\d+.......
(12345678)9!
```
3. Then the engine tries to continue the search from the new position (`9`).
The star `pattern:(\d+)*` can be applied -- it gives the number `match:9`:
```
\d+.......\d+
(12345678)(9)!
```
The engine tries to match `pattern:$` again, but fails, because meets `subject:!`:
```
X
\d+.......\d+
(12345678)(9)z
```
4. There's no match, so the engine will continue backtracking, decreasing the number of repetitions. Backtracking generally works like this: the last greedy quantifier decreases the number of repetitions until it can. Then the previous greedy quantifier decreases, and so on.
All possible combinations are attempted. Here are their examples.
The first number `pattern:\d+` has 7 digits, and then a number of 2 digits:
```
X
\d+......\d+
(1234567)(89)!
```
The first number has 7 digits, and then two numbers of 1 digit each:
```
X
\d+......\d+\d+
(1234567)(8)(9)!
```
The first number has 6 digits, and then a number of 3 digits:
```
X
\d+.......\d+
(123456)(789)!
```
The first number has 6 digits, and then 2 numbers:
```
X
\d+.....\d+ \d+
(123456)(78)(9)!
```
...And so on.
There are many ways to split a set of digits `123456789` into numbers. To be precise, there are <code>2<sup>n</sup>-1</code>, where `n` is the length of the set.
For `n=20` there are about 1 million combinations, for `n=30` - a thousand times more. Trying each of them is exactly the reason why the search takes so long.
What to do?
Should we turn on the lazy mode?
Unfortunately, that won't help: if we replace `pattern:\d+` with `pattern:\d+?`, the regexp will still hang. The order of combinations will change, but not their total count.
Some regular expression engines have tricky tests and finite automations that allow to avoid going through all combinations or make it much faster, but not all engines, and not in all cases.
## Back to words and strings
The similar thing happens in our first example, when we look words by pattern `pattern:^(\w+\s?)*$` in the string `subject:An input that hangs!`.
The reason is that a word can be represented as one `pattern:\w+` or many:
```
(input)
(inpu)(t)
(inp)(u)(t)
(in)(p)(ut)
...
```
For a human, it's obvious that there may be no match, because the string ends with an exclamation sign `!`, but the regular expression expects a wordly character `pattern:\w` or a space `pattern:\s` at the end. But the engine doesn't know that.
It tries all combinations of how the regexp `pattern:(\w+\s?)*` can "consume" the string, including variants with spaces `pattern:(\w+\s)*` and without them `pattern:(\w+)*` (because spaces `pattern:\s?` are optional). As there are many such combinations, the search takes a lot of time.
## How to fix?
There are two main approaches to fixing the problem.
The first is to lower the number of possible combinations.
Let's rewrite the regular expression as `pattern:^(\w+\s)*\w*` - we'll look for any number of words followed by a space `pattern:(\w+\s)*`, and then (optionally) a word `pattern:\w*`.
This regexp is equivalent to the previous one (matches the same) and works well:
```js run
let reg = /^(\w+\s)*\w*$/;
let str = "An input string that takes a long time or even makes this regex to hang!";
alert( reg.test(str) ); // false
```
Why did the problem disappear?
Now the star `pattern:*` goes after `pattern:\w+\s` instead of `pattern:\w+\s?`. It became impossible to represent one word of the string with multiple successive `pattern:\w+`. The time needed to try such combinations is now saved.
For example, the previous pattern `pattern:(\w+\s?)*` could match the word `subject:string` as two `pattern:\w+`:
```js run
\w+\w+
string
```
The previous pattern, due to the optional `pattern:\s` allowed variants `pattern:\w+`, `pattern:\w+\s`, `pattern:\w+\w+` and so on.
With the rewritten pattern `pattern:(\w+\s)*`, that's impossible: there may be `pattern:\w+\s` or `pattern:\w+\s\w+\s`, but not `pattern:\w+\w+`. So the overall combinations count is greatly decreased.
## Preventing backtracking
It's not always convenient to rewrite a regexp. And it's not always obvious how to do it.
The alternative approach is to forbid backtracking for the quantifier.
The regular expressions engine tries many combinations that are obviously wrong for a human.
E.g. in the regexp `pattern:(\d+)*$` it's obvious for a human, that `pattern:+` shouldn't backtrack. If we replace one `pattern:\d+` with two separate `pattern:\d+\d+`, nothing changes:
```
\d+........
(123456789)!
\d+...\d+....
(1234)(56789)!
```
And in the original example `pattern:^(\w+\s?)*$` we may want to forbid backtracking in `pattern:\w+`. That is: `pattern:\w+` should match a whole word, with the maximal possible length. There's no need to lower the repetitions count in `pattern:\w+`, try to split it into two words `pattern:\w+\w+` and so on.
Modern regular expression engines support possessive quantifiers for that. They are like greedy ones, but don't backtrack (so they are actually simpler than regular quantifiers).
There are also so-called "atomic capturing groups" - a way to disable backtracking inside parentheses.
Unfortunately, in JavaScript they are not supported. But there's another way.
### Lookahead to the rescue!
We can prevent backtracking using lookahead.
The pattern to take as much repetitions of `pattern:\w` as possible without backtracking is: `pattern:(?=(\w+))\1`.
Let's decipher it:
- Lookahead `pattern:?=` looks forward for the longest word `pattern:\w+` starting at the current position.
- The contents of parentheses with `pattern:?=...` isn't memorized by the engine, so wrap `pattern:\w+` into parentheses. Then the engine will memorize their contents
- ...And allow us to reference it in the pattern as `pattern:\1`.
That is: we look ahead - and if there's a word `pattern:\w+`, then match it as `pattern:\1`.
Why? That's because the lookahead finds a word `pattern:\w+` as a whole and we capture it into the pattern with `pattern:\1`. So we essentially implemented a possessive plus `pattern:+` quantifier. It captures only the whole word `pattern:\w+`, not a part of it.
For instance, in the word `subject:JavaScript` it may not only match `match:Java`, but leave out `match:Script` to match the rest of the pattern.
Here's the comparison of two patterns:
```js run
alert( "JavaScript".match(/\w+Script/)); // JavaScript
alert( "JavaScript".match(/(?=(\w+))\1Script/)); // null
```
1. In the first variant `pattern:\w+` first captures the whole word `subject:JavaScript` but then `pattern:+` backtracks character by character, to try to match the rest of the pattern, until it finally succeeds (when `pattern:\w+` matches `match:Java`).
2. In the second variant `pattern:(?=(\w+))` looks ahead and finds the word `subject:JavaScript`, that is included into the pattern as a whole by `pattern:\1`, so there remains no way to find `subject:Script` after it.
We can put a more complex regular expression into `pattern:(?=(\w+))\1` instead of `pattern:\w`, when we need to forbid backtracking for `pattern:+` after it.
```smart
There's more about the relation between possessive quantifiers and lookahead in articles [Regex: Emulate Atomic Grouping (and Possessive Quantifiers) with LookAhead](http://instanceof.me/post/52245507631/regex-emulate-atomic-grouping-with-lookahead) and [Mimicking Atomic Groups](http://blog.stevenlevithan.com/archives/mimic-atomic-groups).
```
Let's rewrite the first example using lookahead to prevent backtracking:
```js run
let reg = /^((?=(\w+))\2\s?)*$/;
alert( reg.test("A good string") ); // true
let str = "An input string that takes a long time or even makes this regex to hang!";
alert( reg.test(str) ); // false, works and fast!
```
Here `pattern:\2` is used instead of `pattern:\1`, because there are additional outer parentheses. To avoid messing up with the numbers, we can give the parentheses a name, e.g. `pattern:(?<word>\w+)`.
```js run
// parentheses are named ?<word>, referenced as \k<word>
let reg = /^((?=(?<word>\w+))\k<word>\s?)*$/;
let str = "An input string that takes a long time or even makes this regex to hang!";
alert( reg.test(str) ); // false
alert( reg.test("A correct string") ); // true
```
The problem described in this article is called "catastrophic backtracking".
We covered two ways how to solve it:
- Rewrite the regexp to lower the possible combinations count.
- Prevent backtracking.

View file

@ -1,297 +0,0 @@
# Infinite backtracking problem
Some regular expressions are looking simple, but can execute veeeeeery long time, and even "hang" the JavaScript engine.
Sooner or later most developers occasionally face such behavior.
The typical situation -- a regular expression works fine sometimes, but for certain strings it "hangs" consuming 100% of CPU.
In a web-browser it kills the page. Not a good thing for sure.
For server-side JavaScript it may become a vulnerability, and it uses regular expressions to process user data. Bad input will make the process hang, causing denial of service. The author personally saw and reported such vulnerabilities even for very well-known and widely used programs.
So the problem is definitely worth to deal with.
## Introduction
The plan will be like this:
1. First we see the problem how it may occur.
2. Then we simplify the situation and see why it occurs.
3. Then we fix it.
For instance let's consider searching tags in HTML.
We want to find all tags, with or without attributes -- like `subject:<a href="..." class="doc" ...>`. We need the regexp to work reliably, because HTML comes from the internet and can be messy.
In particular, we need it to match tags like `<a test="<>" href="#">` -- with `<` and `>` in attributes. That's allowed by [HTML standard](https://html.spec.whatwg.org/multipage/syntax.html#syntax-attributes).
A simple regexp like `pattern:<[^>]+>` doesn't work, because it stops at the first `>`, and we need to ignore `<>` if inside an attribute:
```js run
// the match doesn't reach the end of the tag - wrong!
alert( '<a test="<>" href="#">'.match(/<[^>]+>/) ); // <a test="<>
```
To correctly handle such situations we need a more complex regular expression. It will have the form `pattern:<tag (key=value)*>`.
1. For the `tag` name: `pattern:\w+`,
2. For the `key` name: `pattern:\w+`,
3. And the `value`: a quoted string `pattern:"[^"]*"`.
If we substitute these into the pattern above and throw in some optional spaces `pattern:\s`, the full regexp becomes: `pattern:<\w+(\s*\w+="[^"]*"\s*)*>`.
That regexp is not perfect! It doesn't support all the details of HTML syntax, such as unquoted values, and there are other ways to improve, but let's not add complexity. It will demonstrate the problem for us.
The regexp seems to work:
```js run
let reg = /<\w+(\s*\w+="[^"]*"\s*)*>/g;
let str='...<a test="<>" href="#">... <b>...';
alert( str.match(reg) ); // <a test="<>" href="#">, <b>
```
Great! It found both the long tag `match:<a test="<>" href="#">` and the short one `match:<b>`.
Now, that we've got a seemingly working solution, let's get to the infinite backtracking itself.
## Infinite backtracking
If you run our regexp on the input below, it may hang the browser (or another JavaScript host):
```js run
let reg = /<\w+(\s*\w+="[^"]*"\s*)*>/g;
let str = `<tag a="b" a="b" a="b" a="b" a="b" a="b" a="b" a="b"
a="b" a="b" a="b" a="b" a="b" a="b" a="b" a="b" a="b" a="b" a="b" a="b" a="b"`;
*!*
// The search will take a long, long time
alert( str.match(reg) );
*/!*
```
Some regexp engines can handle that search, but most of them can't.
What's the matter? Why a simple regular expression "hangs" on such a small string?
Let's simplify the regexp by stripping the tag name and the quotes. So that we look only for `key=value` attributes: `pattern:<(\s*\w+=\w+\s*)*>`.
Unfortunately, the regexp still hangs:
```js run
// only search for space-delimited attributes
let reg = /<(\s*\w+=\w+\s*)*>/g;
let str = `<a=b a=b a=b a=b a=b a=b a=b a=b
a=b a=b a=b a=b a=b a=b a=b a=b a=b a=b a=b a=b a=b a=b`;
*!*
// the search will take a long, long time
alert( str.match(reg) );
*/!*
```
Here we end the demo of the problem and start looking into what's going on, why it hangs and how to fix it.
## Detailed example
To make an example even simpler, let's consider `pattern:(\d+)*$`.
This regular expression also has the same problem. In most regexp engines that search takes a very long time (careful -- can hang):
```js run
alert( '12345678901234567890123456789123456789z'.match(/(\d+)*$/) );
```
So what's wrong with the regexp?
First, one may notice that the regexp is a little bit strange. The quantifier `pattern:*` looks extraneous. If we want a number, we can use `pattern:\d+$`.
Indeed, the regexp is artificial. But the reason why it is slow is the same as those we saw above. So let's understand it, and then the previous example will become obvious.
What happens during the search of `pattern:(\d+)*$` in the line `subject:123456789z`?
1. First, the regexp engine tries to find a number `pattern:\d+`. The plus `pattern:+` is greedy by default, so it consumes all digits:
```
\d+.......
(123456789)z
```
2. Then it tries to apply the star quantifier, but there are no more digits, so it the star doesn't give anything.
3. Then the pattern expects to see the string end `pattern:$`, and in the text we have `subject:z`, so there's no match:
```
X
\d+........$
(123456789)z
```
4. As there's no match, the greedy quantifier `pattern:+` decreases the count of repetitions (backtracks).
Now `\d+` doesn't take all digits, but all except the last one:
```
\d+.......
(12345678)9z
```
5. Now the engine tries to continue the search from the new position (`9`).
The star `pattern:(\d+)*` can be applied -- it gives the number `match:9`:
```
\d+.......\d+
(12345678)(9)z
```
The engine tries to match `$` again, but fails, because meets `subject:z`:
```
X
\d+.......\d+
(12345678)(9)z
```
5. There's no match, so the engine will continue backtracking, decreasing the number of repetitions for `pattern:\d+` down to 7 digits. So the rest of the string `subject:89` becomes the second `pattern:\d+`:
```
X
\d+......\d+
(1234567)(89)z
```
...Still no match for `pattern:$`.
The search engine backtracks again. Backtracking generally works like this: the last greedy quantifier decreases the number of repetitions until it can. Then the previous greedy quantifier decreases, and so on. In our case the last greedy quantifier is the second `pattern:\d+`, from `subject:89` to `subject:8`, and then the star takes `subject:9`:
```
X
\d+......\d+\d+
(1234567)(8)(9)z
```
6. ...Fail again. The second and third `pattern:\d+` backtracked to the end, so the first quantifier shortens the match to `subject:123456`, and the star takes the rest:
```
X
\d+.......\d+
(123456)(789)z
```
Again no match. The process repeats: the last greedy quantifier releases one character (`9`):
```
X
\d+.....\d+ \d+
(123456)(78)(9)z
```
7. ...And so on.
The regular expression engine goes through all combinations of `123456789` and their subsequences. There are a lot of them, that's why it takes so long.
What to do?
Should we turn on the lazy mode?
Unfortunately, it doesn't: if we replace `pattern:\d+` with `pattern:\d+?`, that still hangs:
```js run
// sloooooowwwwww
alert( '12345678901234567890123456789123456789z'.match(/(\d+?)*$/) );
```
Lazy quantifiers actually do the same, but in the reverse order.
Just think about how the search engine would work in this case.
Some regular expression engines have tricky built-in checks to detect infinite backtracking or other means to work around them, but there's no universal solution.
## Back to tags
In the example above, when we search `pattern:<(\s*\w+=\w+\s*)*>` in the string `subject:<a=b a=b a=b a=b` -- the similar thing happens.
The string has no `>` at the end, so the match is impossible, but the regexp engine doesn't know about it. The search backtracks trying different combinations of `pattern:(\s*\w+=\w+\s*)`:
```
(a=b a=b a=b) (a=b)
(a=b a=b) (a=b a=b)
(a=b) (a=b a=b a=b)
...
```
As there are many combinations, it takes a lot of time.
## How to fix?
The backtracking checks many variants that are an obvious fail for a human.
For instance, in the pattern `pattern:(\d+)*$` a human can easily see that `pattern:(\d+)*` does not need to backtrack `pattern:+`. There's no difference between one or two `\d+`:
```
\d+........
(123456789)z
\d+...\d+....
(1234)(56789)z
```
Let's get back to more real-life example: `pattern:<(\s*\w+=\w+\s*)*>`. We want it to find pairs `name=value` (as many as it can).
What we would like to do is to forbid backtracking.
There's totally no need to decrease the number of repetitions.
In other words, if it found three `name=value` pairs and then can't find `>` after them, then there's no need to decrease the count of repetitions. There are definitely no `>` after those two (we backtracked one `name=value` pair, it's there):
```
(name=value) name=value
```
Modern regexp engines support so-called "possessive" quantifiers for that. They are like greedy, but don't backtrack at all. Pretty simple, they capture whatever they can, and the search continues. There's also another tool called "atomic groups" that forbid backtracking inside parentheses.
Unfortunately, but both these features are not supported by JavaScript.
### Lookahead to the rescue
We can forbid backtracking using lookahead.
The pattern to take as much repetitions as possible without backtracking is: `pattern:(?=(a+))\1`.
In other words:
- The lookahead `pattern:?=` looks for the maximal count `pattern:a+` from the current position.
- And then they are "consumed into the result" by the backreference `pattern:\1` (`pattern:\1` corresponds to the content of the second parentheses, that is `pattern:a+`).
There will be no backtracking, because lookahead does not backtrack. If, for
example, it found 5 instances of `pattern:a+` and the further match failed,
it won't go back to the 4th instance.
```smart
There's more about the relation between possessive quantifiers and lookahead in articles [Regex: Emulate Atomic Grouping (and Possessive Quantifiers) with LookAhead](http://instanceof.me/post/52245507631/regex-emulate-atomic-grouping-with-lookahead) and [Mimicking Atomic Groups](http://blog.stevenlevithan.com/archives/mimic-atomic-groups).
```
So this trick makes the problem disappear.
Let's fix the regexp for a tag with attributes from the beginning of the chapter`pattern:<\w+(\s*\w+=(\w+|"[^"]*")\s*)*>`. We'll use lookahead to prevent backtracking of `name=value` pairs:
```js run
// regexp to search name=value
let attrReg = /(\s*\w+=(\w+|"[^"]*")\s*)/
// use new RegExp to nicely insert its source into (?=(a+))\1
let fixedReg = new RegExp(`<\\w+(?=(${attrReg.source}*))\\1>`, 'g');
let goodInput = '...<a test="<>" href="#">... <b>...';
let badInput = `<tag a=b a=b a=b a=b a=b a=b a=b a=b
a=b a=b a=b a=b a=b a=b a=b a=b a=b a=b a=b a=b a=b`;
alert( goodInput.match(fixedReg) ); // <a test="<>" href="#">, <b>
alert( badInput.match(fixedReg) ); // null (no results, fast!)
```
Great, it works! We found both a long tag `match:<a test="<>" href="#">` and a small one `match:<b>`, and (!) didn't hang the engine on the bad input.
Please note the `attrReg.source` property. `RegExp` objects provide access to their source string in it. That's convenient when we want to insert one regexp into another.