minor fixes

This commit is contained in:
Ilya Kantor 2020-10-06 00:47:35 +03:00
parent ae06ca62bb
commit f409905f7b
2 changed files with 98 additions and 75 deletions

View file

@ -3,13 +3,15 @@
[recent browser="new"] [recent browser="new"]
The optional chaining `?.` is an error-proof way to access nested object properties, even if an intermediate property doesn't exist. The optional chaining `?.` is a safe way to access nested object properties, even if an intermediate property doesn't exist.
## The problem ## The "non-existing property" problem
If you've just started to read the tutorial and learn JavaScript, maybe the problem hasn't touched you yet, but it's quite common. If you've just started to read the tutorial and learn JavaScript, maybe the problem hasn't touched you yet, but it's quite common.
For example, some of our users have addresses, but few did not provide them. Then we can't safely read `user.address.street`: As an example, let's consider objects for user data. Most of our users enter addresses, but some did not provide them.
In such case, when we attempt to get `user.address.street`, we'll get an error:
```js run ```js run
let user = {}; // the user happens to be without address let user = {}; // the user happens to be without address
@ -17,7 +19,7 @@ let user = {}; // the user happens to be without address
alert(user.address.street); // Error! alert(user.address.street); // Error!
``` ```
Or, in the web development, we'd like to get an information about an element on the page, but it may not exist: Another example. In the web development, we may need to get an information about an element on the page, that sometimes doesn't exist:
```js run ```js run
// Error if the result of querySelector(...) is null // Error if the result of querySelector(...) is null
@ -34,7 +36,7 @@ let user = {}; // user has no address
alert( user && user.address && user.address.street ); // undefined (no error) alert( user && user.address && user.address.street ); // undefined (no error)
``` ```
AND'ing the whole path to the property ensures that all components exist, but is cumbersome to write. AND'ing the whole path to the property ensures that all components exist (if not, the evaluation stops), but is cumbersome to write.
## Optional chaining ## Optional chaining
@ -70,7 +72,7 @@ We should use `?.` only where it's ok that something doesn't exist.
For example, if according to our coding logic `user` object must be there, but `address` is optional, then `user.address?.street` would be better. For example, if according to our coding logic `user` object must be there, but `address` is optional, then `user.address?.street` would be better.
So, if `user` happens to be undefined due to a mistake, we'll know about it and fix it. Otherwise, coding errors can be silenced where not appropriate, and become more difficult to debug. So, if `user` happens to be undefined due to a mistake, we'll see a programming error about it and fix it. Otherwise, coding errors can be silenced where not appropriate, and become more difficult to debug.
``` ```
````warn header="The variable before `?.` must be declared" ````warn header="The variable before `?.` must be declared"
@ -80,25 +82,27 @@ If there's no variable `user` at all, then `user?.anything` triggers an error:
// ReferenceError: user is not defined // ReferenceError: user is not defined
user?.address; user?.address;
``` ```
There must be `let/const/var user`. The optional chaining works only for declared variables. There must be a declaration (e.g. `let/const/var user`). The optional chaining works only for declared variables.
```` ````
## Short-circuiting ## Short-circuiting
As it was said before, the `?.` immediately stops ("short-circuits") the evaluation if the left part doesn't exist. As it was said before, the `?.` immediately stops ("short-circuits") the evaluation if the left part doesn't exist.
So, if there are any further function calls or side effects, they don't occur: So, if there are any further function calls or side effects, they don't occur.
For instance:
```js run ```js run
let user = null; let user = null;
let x = 0; let x = 0;
user?.sayHi(x++); // nothing happens user?.sayHi(x++); // no "sayHi", so the execution doesn't reach x++
alert(x); // 0, value not incremented alert(x); // 0, value not incremented
``` ```
## Other cases: ?.(), ?.[] ## Other variants: ?.(), ?.[]
The optional chaining `?.` is not an operator, but a special syntax construct, that also works with functions and square brackets. The optional chaining `?.` is not an operator, but a special syntax construct, that also works with functions and square brackets.
@ -121,9 +125,9 @@ user2.admin?.();
*/!* */!*
``` ```
Here, in both lines we first use the dot `.` to get `admin` property, because the user object must exist, so it's safe read from it. Here, in both lines we first use the dot (`user1.admin`) to get `admin` property, because the user object must exist, so it's safe read from it.
Then `?.()` checks the left part: if the admin function exists, then it runs (for `user1`). Otherwise (for `user2`) the evaluation stops without errors. Then `?.()` checks the left part: if the admin function exists, then it runs (that's so for `user1`). Otherwise (for `user2`) the evaluation stops without errors.
The `?.[]` syntax also works, if we'd like to use brackets `[]` to access properties instead of dot `.`. Similar to previous cases, it allows to safely read a property from an object that may not exist. The `?.[]` syntax also works, if we'd like to use brackets `[]` to access properties instead of dot `.`. Similar to previous cases, it allows to safely read a property from an object that may not exist.
@ -148,19 +152,23 @@ Also we can use `?.` with `delete`:
delete user?.name; // delete user.name if user exists delete user?.name; // delete user.name if user exists
``` ```
```warn header="We can use `?.` for safe reading and deleting, but not writing" ````warn header="We can use `?.` for safe reading and deleting, but not writing"
The optional chaining `?.` has no use at the left side of an assignment: The optional chaining `?.` has no use at the left side of an assignment.
For example:
```js run ```js run
// the idea of the code below is to write user.name, if user exists let user = null;
user?.name = "John"; // Error, doesn't work user?.name = "John"; // Error, doesn't work
// because it evaluates to undefined = "John" // because it evaluates to undefined = "John"
``` ```
It's just not that smart.
````
## Summary ## Summary
The `?.` syntax has three forms: The optional chaining `?.` syntax has three forms:
1. `obj?.prop` -- returns `obj.prop` if `obj` exists, otherwise `undefined`. 1. `obj?.prop` -- returns `obj.prop` if `obj` exists, otherwise `undefined`.
2. `obj?.[prop]` -- returns `obj[prop]` if `obj` exists, otherwise `undefined`. 2. `obj?.[prop]` -- returns `obj[prop]` if `obj` exists, otherwise `undefined`.
@ -170,6 +178,4 @@ As we can see, all of them are straightforward and simple to use. The `?.` check
A chain of `?.` allows to safely access nested properties. A chain of `?.` allows to safely access nested properties.
Still, we should apply `?.` carefully, only where it's ok that the left part doesn't to exist. Still, we should apply `?.` carefully, only where it's acceptable that the left part doesn't to exist. So that it won't hide programming errors from us, if they occur.
So that it won't hide programming errors from us, if they occur.

View file

@ -1,20 +1,20 @@
# Catastrophic backtracking # Catastrophic backtracking
Some regular expressions are looking simple, but can execute veeeeeery long time, and even "hang" the JavaScript engine. Some regular expressions are looking simple, but can execute a veeeeeery long time, and even "hang" the JavaScript engine.
Sooner or later most developers occasionally face such behavior, because it's quite easy to create such a regexp. Sooner or later most developers occasionally face such behavior. The typical symptom -- a regular expression works fine sometimes, but for certain strings it "hangs", consuming 100% of CPU.
The typical symptom -- a regular expression works fine sometimes, but for certain strings it "hangs", consuming 100% of CPU.
In such case a web-browser suggests to kill the script and reload the page. Not a good thing for sure. In such case a web-browser suggests to kill the script and reload the page. Not a good thing for sure.
For server-side JavaScript it may become a vulnerability if regular expressions process user data. For server-side JavaScript such a regexp may hang the server process, that's even worse. So we definitely should take a look at it.
## Example ## Example
Let's say we have a string, and we'd like to check if it consists of words `pattern:\w+` with an optional space `pattern:\s?` after each. Let's say we have a string, and we'd like to check if it consists of words `pattern:\w+` with an optional space `pattern:\s?` after each.
We'll use a regexp `pattern:^(\w+\s?)*$`, it specifies 0 or more such words. An obvious way to construct a regexp would be to take a word followed by an optional space `pattern:\w+\s?` and then repeat it with `*`.
That leads us to the regexp `pattern:^(\w+\s?)*$`, it specifies zero or more such words, that start at the beginning `pattern:^` and finish at the end `pattern:$` of the line.
In action: In action:
@ -25,9 +25,9 @@ alert( regexp.test("A good string") ); // true
alert( regexp.test("Bad characters: $@#") ); // false alert( regexp.test("Bad characters: $@#") ); // false
``` ```
It seems to work. The result is correct. Although, on certain strings it takes a lot of time. So long that JavaScript engine "hangs" with 100% CPU consumption. The regexp seems to work. The result is correct. Although, on certain strings it takes a lot of time. So long that JavaScript engine "hangs" with 100% CPU consumption.
If you run the example below, you probably won't see anything, as JavaScript will just "hang". A web-browser will stop reacting on events, the UI will stop working. After some time it will suggest to reloaad the page. So be careful with this: If you run the example below, you probably won't see anything, as JavaScript will just "hang". A web-browser will stop reacting on events, the UI will stop working (most browsers allow only scrolling). After some time it will suggest to reload the page. So be careful with this:
```js run ```js run
let regexp = /^(\w+\s?)*$/; let regexp = /^(\w+\s?)*$/;
@ -37,24 +37,22 @@ let str = "An input string that takes a long time or even makes this regexp to h
alert( regexp.test(str) ); alert( regexp.test(str) );
``` ```
Some regular expression engines can handle such search, but most of them can't. To be fair, let's note that some regular expression engines can handle such a search effectively. But most of them can't. Browser engines usually hang.
## Simplified example ## Simplified example
What's the matter? Why the regular expression "hangs"? What's the matter? Why the regular expression hangs?
To understand that, let's simplify the example: remove spaces `pattern:\s?`. Then it becomes `pattern:^(\w+)*$`. To understand that, let's simplify the example: remove spaces `pattern:\s?`. Then it becomes `pattern:^(\w+)*$`.
And, to make things more obvious, let's replace `pattern:\w` with `pattern:\d`. The resulting regular expression still hangs, for instance: And, to make things more obvious, let's replace `pattern:\w` with `pattern:\d`. The resulting regular expression still hangs, for instance:
<!-- let str = `AnInputStringThatMakesItHang!`; -->
```js run ```js run
let regexp = /^(\d+)*$/; let regexp = /^(\d+)*$/;
let str = "012345678901234567890123456789!"; let str = "012345678901234567890123456789z";
// will take a very long time // will take a very long time (careful!)
alert( regexp.test(str) ); alert( regexp.test(str) );
``` ```
@ -62,45 +60,49 @@ So what's wrong with the regexp?
First, one may notice that the regexp `pattern:(\d+)*` is a little bit strange. The quantifier `pattern:*` looks extraneous. If we want a number, we can use `pattern:\d+`. First, one may notice that the regexp `pattern:(\d+)*` is a little bit strange. The quantifier `pattern:*` looks extraneous. If we want a number, we can use `pattern:\d+`.
Indeed, the regexp is artificial. But the reason why it is slow is the same as those we saw above. So let's understand it, and then the previous example will become obvious. Indeed, the regexp is artificial, we got it by simplifying the previous example. But the reason why it is slow is the same. So let's understand it, and then the previous example will become obvious.
What happens during the search of `pattern:^(\d+)*$` in the line `subject:123456789!` (shortened a bit for clarity), why does it take so long? What happens during the search of `pattern:^(\d+)*$` in the line `subject:123456789z` (shortened a bit for clarity, please note a non-digit character `subject:z` at the end, it's important), why does it take so long?
1. First, the regexp engine tries to find a number `pattern:\d+`. The plus `pattern:+` is greedy by default, so it consumes all digits: Here's what the regexp engine does:
1. First, the regexp engine tries to find the content of the parentheses: the number `pattern:\d+`. The plus `pattern:+` is greedy by default, so it consumes all digits:
``` ```
\d+....... \d+.......
(123456789)z (123456789)z
``` ```
Then it tries to apply the star quantifier, but there are no more digits, so it the star doesn't give anything. After all digits are consumed, `pattern:\d+` is considered found (as `match:123456789`).
The next in the pattern is the string end `pattern:$`, but in the text we have `subject:!`, so there's no match: Then the star quantifier `pattern:(\d+)*` applies. But there are no more digits in the text, so the star doesn't give anything.
The next character in the pattern is the string end `pattern:$`. But in the text we have `subject:z` instead, so there's no match:
``` ```
X X
\d+........$ \d+........$
(123456789)! (123456789)z
``` ```
2. As there's no match, the greedy quantifier `pattern:+` decreases the count of repetitions, backtracks one character back. 2. As there's no match, the greedy quantifier `pattern:+` decreases the count of repetitions, backtracks one character back.
Now `pattern:\d+` takes all digits except the last one: Now `pattern:\d+` takes all digits except the last one (`match:12345678`):
``` ```
\d+....... \d+.......
(12345678)9! (12345678)9z
``` ```
3. Then the engine tries to continue the search from the new position (`9`). 3. Then the engine tries to continue the search from the next position (right after `match:12345678`).
The star `pattern:(\d+)*` can be applied -- it gives the number `match:9`: The star `pattern:(\d+)*` can be applied -- it gives one more match of `pattern:\d+`, the number `match:9`:
``` ```
\d+.......\d+ \d+.......\d+
(12345678)(9)! (12345678)(9)z
``` ```
The engine tries to match `pattern:$` again, but fails, because meets `subject:!`: The engine tries to match `pattern:$` again, but fails, because it meets `subject:z` instead:
``` ```
X X
@ -118,7 +120,7 @@ What happens during the search of `pattern:^(\d+)*$` in the line `subject:123456
``` ```
X X
\d+......\d+ \d+......\d+
(1234567)(89)! (1234567)(89)z
``` ```
The first number has 7 digits, and then two numbers of 1 digit each: The first number has 7 digits, and then two numbers of 1 digit each:
@ -126,7 +128,7 @@ What happens during the search of `pattern:^(\d+)*$` in the line `subject:123456
``` ```
X X
\d+......\d+\d+ \d+......\d+\d+
(1234567)(8)(9)! (1234567)(8)(9)z
``` ```
The first number has 6 digits, and then a number of 3 digits: The first number has 6 digits, and then a number of 3 digits:
@ -134,7 +136,7 @@ What happens during the search of `pattern:^(\d+)*$` in the line `subject:123456
``` ```
X X
\d+.......\d+ \d+.......\d+
(123456)(789)! (123456)(789)z
``` ```
The first number has 6 digits, and then 2 numbers: The first number has 6 digits, and then 2 numbers:
@ -142,23 +144,19 @@ What happens during the search of `pattern:^(\d+)*$` in the line `subject:123456
``` ```
X X
\d+.....\d+ \d+ \d+.....\d+ \d+
(123456)(78)(9)! (123456)(78)(9)z
``` ```
...And so on. ...And so on.
There are many ways to split a set of digits `123456789` into numbers. To be precise, there are <code>2<sup>n</sup>-1</code>, where `n` is the length of the set. There are many ways to split a sequence of digits `123456789` into numbers. To be precise, there are <code>2<sup>n</sup>-1</code>, where `n` is the length of the sequence.
For `n=20` there are about 1 million combinations, for `n=30` - a thousand times more. Trying each of them is exactly the reason why the search takes so long. - For `123456789` we have `n=9`, that gives 511 combinations.
- For a longer sequence with `n=20` there are about one million (1048575) combinations.
- For `n=30` - a thousand times more (1073741823 combinations).
What to do? Trying each of them is exactly the reason why the search takes so long.
Should we turn on the lazy mode?
Unfortunately, that won't help: if we replace `pattern:\d+` with `pattern:\d+?`, the regexp will still hang. The order of combinations will change, but not their total count.
Some regular expression engines have tricky tests and finite automations that allow to avoid going through all combinations or make it much faster, but not all engines, and not in all cases.
## Back to words and strings ## Back to words and strings
@ -176,7 +174,15 @@ The reason is that a word can be represented as one `pattern:\w+` or many:
For a human, it's obvious that there may be no match, because the string ends with an exclamation sign `!`, but the regular expression expects a wordly character `pattern:\w` or a space `pattern:\s` at the end. But the engine doesn't know that. For a human, it's obvious that there may be no match, because the string ends with an exclamation sign `!`, but the regular expression expects a wordly character `pattern:\w` or a space `pattern:\s` at the end. But the engine doesn't know that.
It tries all combinations of how the regexp `pattern:(\w+\s?)*` can "consume" the string, including variants with spaces `pattern:(\w+\s)*` and without them `pattern:(\w+)*` (because spaces `pattern:\s?` are optional). As there are many such combinations, the search takes a lot of time. It tries all combinations of how the regexp `pattern:(\w+\s?)*` can "consume" the string, including variants with spaces `pattern:(\w+\s)*` and without them `pattern:(\w+)*` (because spaces `pattern:\s?` are optional). As there are many such combinations (we've seen it with digits), the search takes a lot of time.
What to do?
Should we turn on the lazy mode?
Unfortunately, that won't help: if we replace `pattern:\w+` with `pattern:\w+?`, the regexp will still hang. The order of combinations will change, but not their total count.
Some regular expression engines have tricky tests and finite automations that allow to avoid going through all combinations or make it much faster, but most engines don't, and it doesn't always help.
## How to fix? ## How to fix?
@ -184,7 +190,7 @@ There are two main approaches to fixing the problem.
The first is to lower the number of possible combinations. The first is to lower the number of possible combinations.
Let's rewrite the regular expression as `pattern:^(\w+\s)*\w*` - we'll look for any number of words followed by a space `pattern:(\w+\s)*`, and then (optionally) a word `pattern:\w*`. Let's make the space non-optional by rewriting the regular expression as `pattern:^(\w+\s)*\w*$` - we'll look for any number of words followed by a space `pattern:(\w+\s)*`, and then (optionally) a final word `pattern:\w*`.
This regexp is equivalent to the previous one (matches the same) and works well: This regexp is equivalent to the previous one (matches the same) and works well:
@ -197,26 +203,30 @@ alert( regexp.test(str) ); // false
Why did the problem disappear? Why did the problem disappear?
Now the star `pattern:*` goes after `pattern:\w+\s` instead of `pattern:\w+\s?`. It became impossible to represent one word of the string with multiple successive `pattern:\w+`. The time needed to try such combinations is now saved. That's because now the space is mandatory.
For example, the previous pattern `pattern:(\w+\s?)*` could match the word `subject:string` as two `pattern:\w+`: The previous regexp, if we omit the space, becomes `pattern:(\w+)*`, leading to many combinations of `\w+` within a single word
```js run So `subject:input` could be matched as two repetitions of `pattern:\w+`, like this:
```
\w+ \w+ \w+ \w+
string (inp)(ut)
``` ```
The previous pattern, due to the optional `pattern:\s` allowed variants `pattern:\w+`, `pattern:\w+\s`, `pattern:\w+\w+` and so on. The new pattern is different: `pattern:(\w+\s)*` specifies repetitions of words followed by a space! The `subject:input` string can't be matched as two repetitions of `pattern:\w+\s`, because the space is mandatory.
With the rewritten pattern `pattern:(\w+\s)*`, that's impossible: there may be `pattern:\w+\s` or `pattern:\w+\s\w+\s`, but not `pattern:\w+\w+`. So the overall combinations count is greatly decreased. The time needed to try a lot of (actually most of) combinations is now saved.
## Preventing backtracking ## Preventing backtracking
It's not always convenient to rewrite a regexp. And it's not always obvious how to do it. It's not always convenient to rewrite a regexp though. In the example above it was easy, but it's not always obvious how to do it.
The alternative approach is to forbid backtracking for the quantifier. Besides, a rewritten regexp is usually more complex, and that's not good. Regexps are complex enough without extra efforts.
The regular expressions engine tries many combinations that are obviously wrong for a human. Luckily, there's an alternative approach. We can forbid backtracking for the quantifier.
The root of the problem is that the regexp engine tries many combinations that are obviously wrong for a human.
E.g. in the regexp `pattern:(\d+)*$` it's obvious for a human, that `pattern:+` shouldn't backtrack. If we replace one `pattern:\d+` with two separate `pattern:\d+\d+`, nothing changes: E.g. in the regexp `pattern:(\d+)*$` it's obvious for a human, that `pattern:+` shouldn't backtrack. If we replace one `pattern:\d+` with two separate `pattern:\d+\d+`, nothing changes:
@ -230,19 +240,26 @@ E.g. in the regexp `pattern:(\d+)*$` it's obvious for a human, that `pattern:+`
And in the original example `pattern:^(\w+\s?)*$` we may want to forbid backtracking in `pattern:\w+`. That is: `pattern:\w+` should match a whole word, with the maximal possible length. There's no need to lower the repetitions count in `pattern:\w+`, try to split it into two words `pattern:\w+\w+` and so on. And in the original example `pattern:^(\w+\s?)*$` we may want to forbid backtracking in `pattern:\w+`. That is: `pattern:\w+` should match a whole word, with the maximal possible length. There's no need to lower the repetitions count in `pattern:\w+`, try to split it into two words `pattern:\w+\w+` and so on.
Modern regular expression engines support possessive quantifiers for that. They are like greedy ones, but don't backtrack (so they are actually simpler than regular quantifiers). Modern regular expression engines support possessive quantifiers for that. Regular quantifiers become possessive if we add `pattern:+` after them. That is, we use `pattern:\d++` instead of `pattern:\d+` to stop `pattern:+` from backtracking.
Possessive quantifiers are in fact simpler than "regular" ones. They just match as many as they can, without any backtracking. The search process without bracktracking is simpler.
There are also so-called "atomic capturing groups" - a way to disable backtracking inside parentheses. There are also so-called "atomic capturing groups" - a way to disable backtracking inside parentheses.
Unfortunately, in JavaScript they are not supported. But there's another way. ...But the bad news is that, unfortunately, in JavaScript they are not supported.
We can emulate them though using a "lookahead transform".
### Lookahead to the rescue! ### Lookahead to the rescue!
We can prevent backtracking using lookahead. So we've come to real advanced topics. We'd like a quantifier, such as `pattern:+` not to backtrack, because sometimes backtracking makes no sense.
The pattern to take as much repetitions of `pattern:\w` as possible without backtracking is: `pattern:(?=(\w+))\1`. The pattern to take as much repetitions of `pattern:\w` as possible without backtracking is: `pattern:(?=(\w+))\1`. Of course, we could take another pattern instead of `pattern:\w`.
That may seem odd, but it's actually a very simple transform.
Let's decipher it: Let's decipher it:
- Lookahead `pattern:?=` looks forward for the longest word `pattern:\w+` starting at the current position. - Lookahead `pattern:?=` looks forward for the longest word `pattern:\w+` starting at the current position.
- The contents of parentheses with `pattern:?=...` isn't memorized by the engine, so wrap `pattern:\w+` into parentheses. Then the engine will memorize their contents - The contents of parentheses with `pattern:?=...` isn't memorized by the engine, so wrap `pattern:\w+` into parentheses. Then the engine will memorize their contents
- ...And allow us to reference it in the pattern as `pattern:\1`. - ...And allow us to reference it in the pattern as `pattern:\1`.