# Infinite backtracking problem Some regular expressions are looking simple, but can execute veeeeeery long time, and even "hang" the JavaScript engine. Sooner or later most developers occasionally face such behavior. The typical situation -- a regular expression works fine sometimes, but for certain strings it "hangs" consuming 100% of CPU. That may even be a vulnerability. For instance, if JavaScript is on the server, and it uses regular expressions to process user data, then such an input may cause denial of service. The author personally saw and reported such vulnerabilities even for well-known and widely used programs. So the problem is definitely worth to deal with. ## Example The plan will be like this: 1. First we see the problem how it may occur. 2. Then we simplify the situation and see why it occurs. 3. Then we fix it. For instance let's consider searching tags in HTML. We want to find all tags, with or without attributes -- like `subject:`. We need the regexp to work reliably, because HTML comes from the internet and can be messy. In particular, we need it to match tags like `` -- with `<` and `>` in attributes. That's allowed by [HTML standard](https://html.spec.whatwg.org/multipage/syntax.html#syntax-attributes). Now we can see that a simple regexp like `pattern:<[^>]+>` doesn't work, because it stops at the first `>`, and we need to ignore `<>` inside an attribute. ```js run // the match doesn't reach the end of the tag - wrong! alert( ''.match(/<[^>]+>/) ); // `: 1. `pattern:<\w+` -- is the tag start, 2. `pattern:(\s*\w+=(\w+|"[^"]*")\s*)*` -- is an arbitrary number of pairs `word=value`, where the value can be either a word `pattern:\w+` or a quoted string `pattern:"[^"]*"`. That doesn't yet support few details of HTML grammar, for instance strings in 'single' quotes, but they can be added later, so that's somewhat close to real life. For now we want the regexp to be simple. Let's try it in action: ```js run let reg = /<\w+(\s*\w+=(\w+|"[^"]*")\s*)*>/g; let str='...... ...'; alert( str.match(reg) ); // , ``` Great, it works! It found both the long tag `match:` and the short one `match:`. Now let's see the problem. If you run the example below, it may hang the browser (or whatever JavaScript engine runs): ```js run let reg = /<\w+(\s*\w+=(\w+|"[^"]*")\s*)*>/g; let str = `/g; let str = `` in the string `subject:` at the end, so the match is impossible, but the regexp engine does not know about it. The search backtracks trying different combinations of `pattern:(\s*\w+=\w+\s*)`: ``` (a=b a=b a=b) (a=b) (a=b a=b) (a=b a=b) ... ``` ## How to fix? The problem -- too many variants in backtracking even if we don't need them. For instance, in the pattern `pattern:(\d+)*$` we (people) can easily see that `pattern:(\d+)` does not need to backtrack. Decreasing the count of `pattern:\d+` can not help to find a match, there's no matter between these two: ``` \d+........ (123456789)z \d+...\d+.... (1234)(56789)z ``` Let's get back to more real-life example: `pattern:<(\s*\w+=\w+\s*)*>`. We want it to find pairs `name=value` (as many as it can). There's no need in backtracking here. In other words, if it found many `name=value` pairs and then can't find `>`, then there's no need to decrease the count of repetitions. Even if we match one pair less, it won't give us the closing `>`: Modern regexp engines support so-called "possessive" quantifiers for that. They are like greedy, but don't backtrack at all. Pretty simple, they capture whatever they can, and the search continues. There's also another tool called "atomic groups" that forbid backtracking inside parentheses. Unfortunately, but both these features are not supported by JavaScript. Although we can get a similar affect using lookahead. There's more about the relation between possessive quantifiers and lookahead in articles [Regex: Emulate Atomic Grouping (and Possessive Quantifiers) with LookAhead](http://instanceof.me/post/52245507631/regex-emulate-atomic-grouping-with-lookahead) and [Mimicking Atomic Groups](http://blog.stevenlevithan.com/archives/mimic-atomic-groups). The pattern to take as much repetitions as possible without backtracking is: `pattern:(?=(a+))\1`. In other words, the lookahead `pattern:?=` looks for the maximal count `pattern:a+` from the current position. And then they are "consumed into the result" by the backreference `pattern:\1`. There will be no backtracking, because lookahead does not backtrack. If it found like 5 times of `pattern:a+` and the further match failed, then it doesn't go back to 4. Let's fix the regexp for a tag with attributes from the beginning of the chapter`pattern:<\w+(\s*\w+=(\w+|"[^"]*")\s*)*>`. We'll use lookahead to prevent backtracking of `name=value` pairs: ```js run // regexp to search name=value let attrReg = /(\s*\w+=(\w+|"[^"]*")\s*)/ // use it inside the regexp for tag let reg = new RegExp('<\\w+(?=(' + attrReg.source + '*))\\1>', 'g'); let good = '...... ...'; let bad = `, alert( bad.match(reg) ); // null (no results, fast!) ``` Great, it works! We found a long tag `match:` and a small one `match:` and didn't hang the engine. Please note the `attrReg.source` property. `RegExp` objects provide access to their source string in it. That's convenient when we want to insert one regexp into another.