# Infinite backtracking problem Some regular expressions are looking simple, but can execute veeeeeery long time, and even "hang" the JavaScript engine. Sooner or later all developers occasionally meets this behavior. The typical situation -- a regular expression works fine for some time, and then starts to "hang" the script and make it consume 100% of CPU. That may even be a vulnerability. For instance, if JavaScript is on the server and uses regular expressions on user data. There were many vulnerabilities of that kind even in widely distributed systems. So the problem is definitely worth to deal with. [cut] ## Example The plan will be like this: 1. First we see the problem how it may occur. 2. Then we simplify the situation and see why it occurs. 3. Then we fix it. For instance let's consider searching tags in HTML. We want to find all tags, with or without attributes -- like `subject:`. We need the regexp to work reliably, because HTML comes from the internet and can be messy. In particular, we need it to match tags like `` -- with `<` and `>` in attributes. That's allowed by [HTML standard](https://html.spec.whatwg.org/multipage/syntax.html#syntax-attributes). Now we can see that a simple regexp like `pattern:<[^>]+>` doesn't work, because it stops at the first `>`, and we need to ignore `<>` inside an attribute. ```js run // the match doesn't reach the end of the tag - wrong! alert( ''.match(/<[^>]+>/) ); // `: 1. `pattern:<\w+` -- is the tag start, 2. `pattern:(\s*\w+=(\w+|"[^"]*")\s*)*` -- is an arbitrary number of pairs `word=value`, where the value can be either a word `pattern:\w+` or a quoted string `pattern:"[^"]*"`. That doesn't yet support the details of HTML grammer, for instance strings can be in 'single' quotes, but these can be added later, so that's somewhat close to real life. For now we want the regexp to be simple. Let's try it in action: ```js run let reg = /<\w+(\s*\w+=(\w+|"[^"]*")\s*)*>/g; let str='...... ...'; alert( str.match(reg) ); // , ``` Great, it works! It found both the long tag `match:` and the short one `match:`. Now let's see the problem. If you run the example below, it may hang the browser (or another JavaScript engine): ```js run let reg = /<\w+(\s*\w+=(\w+|"[^"]*")\s*)*>/g; let str = `/g; let str = `` in the string `subject:` at the end, so the match is impossible, but the regexp engine does not know about it. The search backtracks trying different combinations of `pattern:(\s*\w+=\w+\s*)`: ``` (a=b a=b a=b) (a=b) (a=b a=b) (a=b a=b) ... ``` ## How to fix? The problem -- too many variants in backtracking even if we don't need them. For instance, in the pattern `pattern:(\d+)*$` we (people) can easily see that `pattern:(\d+)` does not need to backtrack. Decreasing the count of `pattern:\d+` can not help to find a match, there's no matter between these two: ``` \d+........ (123456789)z \d+...\d+.... (1234)(56789)z ``` Если вернуться к более реальному примеру `pattern:<(\s*\w+=\w+\s*)*>` то сам алгоритм поиска, который у нас в голове, предусматривает, что мы "просто" ищем тег, а потом пары `атрибут=значение` (сколько получится). Никакого "отката" здесь не нужно. В современных регулярных выражениях для решения этой проблемы придумали "possessive" (сверхжадные? неоткатные? точный перевод пока не устоялся) квантификаторы, которые вообще не используют бэктрегинг. То есть, они даже проще, чем "жадные" -- берут максимальное количество символов и всё. Поиск продолжается дальше. При несовпадении никакого возврата не происходит. Это, с одной стороны, уменьшает количество возможных результатов, но, с другой стороны, в ряде случаев очевидно, что возврат (уменьшение количество повторений квантификатора) результата не даст. А только потратит время, что как раз и доставляет проблемы. Как раз такие ситуации и описаны выше. Есть и другое средство -- "атомарные скобочные группы", которые запрещают перебор внутри скобок, по сути позволяя добиваться того же, что и сверхжадные квантификаторы, К сожалению, в JavaScript они не поддерживаются. Однако, можно получить подобный эффект при помощи предпросмотра. Подробное описание соответствия с учётом синтаксиса сверхжадных квантификаторов и атомарных групп есть в статьях [Regex: Emulate Atomic Grouping (and Possessive Quantifiers) with LookAhead](http://instanceof.me/post/52245507631/regex-emulate-atomic-grouping-with-lookahead) и [Mimicking Atomic Groups](http://blog.stevenlevithan.com/archives/mimic-atomic-groups), здесь же мы останемся в рамках синтаксиса JavaScript. Взятие максимального количества повторений `a+` без отката выглядит так: `pattern:(?=(a+))\1`. То есть, иными словами, предпросмотр `pattern:?=` ищет максимальное количество повторений `pattern:a+`, доступных с текущей позиции. А затем они "берутся в результат" обратной ссылкой `pattern:\1`. Дальнейший поиск -- после найденных повторений. Откат в этой логике в принципе не предусмотрен, поскольку предпросмотр "откатываться" не умеет. То есть, если предпросмотр нашёл 5 штук `pattern:a+`, и в результате поиск не удался, то он не будет откатываться на 4 повторения. Эта возможность в предпросмотре отсутствует, а в данном случае она как раз и не нужна. Исправим регэксп для поиска тега с атрибутами `pattern:<\w+(\s*\w+=(\w+|"[^"]*")\s*)*>`, описанный в начале главы. Используем предпросмотр, чтобы запретить откат на меньшее количество пар `атрибут=значение`: ```js run // регэксп для пары атрибут=значение let attr = /(\s*\w+=(\w+|"[^"]*")\s*)/ // используем его внутри регэкспа для тега let reg = new RegExp('<\\w+(?=(' + attr.source + '*))\\1>', 'g'); let good = '...... ...'; let bad = ", alert( bad.match(reg) ); // null (нет результатов, быстро) ``` Отлично, всё работает! Нашло как длинный тег `match:`, так и одинокий `match:`.