This commit is contained in:
Ilya Kantor 2019-09-05 14:57:06 +03:00
parent fc0b18538d
commit 20547570ff
12 changed files with 376 additions and 186 deletions

View file

@ -41,7 +41,7 @@ Most used are:
: A digit: a character from `0` to `9`.
`pattern:\s` ("s" is from "space")
: A space symbol: includes spaces, tabs `\t`, newlines `\n` and few other rare characters: `\v`, `\f` and `\r`.
: A space symbol: includes spaces, tabs `\t`, newlines `\n` and few other rare characters, such as `\v`, `\f` and `\r`.
`pattern:\w` ("w" is from "word")
: A "wordly" character: either a letter of Latin alphabet or a digit or an underscore `_`. Non-Latin letters (like cyrillic or hindi) do not belong to `pattern:\w`.

View file

@ -22,7 +22,7 @@ So the example below gives no matches:
alert( "Voila".match(/V[oi]la/) ); // null, no matches
```
The pattern assumes:
The pattern searches for:
- `pattern:V`,
- then *one* of the letters `pattern:[oi]`,
@ -42,23 +42,56 @@ In the example below we're searching for `"x"` followed by two digits or letters
alert( "Exception 0xAF".match(/x[0-9A-F][0-9A-F]/g) ); // xAF
```
Please note that in the word `subject:Exception` there's a substring `subject:xce`. It didn't match the pattern, because the letters are lowercase, while in the set `pattern:[0-9A-F]` they are uppercase.
Here `pattern:[0-9A-F]` has two ranges: it searches for a character that is either a digit from `0` to `9` or a letter from `A` to `F`.
If we want to find it too, then we can add a range `a-f`: `pattern:[0-9A-Fa-f]`. The `pattern:i` flag would allow lowercase too.
If we'd like to look for lowercase letters as well, we can add the range `a-f`: `pattern:[0-9A-Fa-f]`. Or add the flag `pattern:i`.
**Character classes are shorthands for certain character sets.**
We can also use character classes inside `[…]`.
For instance, if we'd like to look for a wordly character `pattern:\w` or a hyphen `pattern:-`, then the set is `pattern:[\w-]`.
Combining multiple classes is also possible, e.g. `pattern:[\s\d]` means "a space character or a digit".
```smart header="Character classes are shorthands for certain character sets"
For instance:
- **\d** -- is the same as `pattern:[0-9]`,
- **\w** -- is the same as `pattern:[a-zA-Z0-9_]`,
- **\s** -- is the same as `pattern:[\t\n\v\f\r ]` plus few other unicode space characters.
- **\s** -- is the same as `pattern:[\t\n\v\f\r ]`, plus few other rare unicode space characters.
```
We can use character classes inside `[…]` as well.
### Example: multi-language \w
For instance, we want to match all wordly characters or a dash, for words like "twenty-third". We can't do it with `pattern:\w+`, because `pattern:\w` class does not include a dash. But we can use `pattern:[\w-]`.
As the character class `pattern:\w` is a shorthand for `pattern:[a-zA-Z0-9_]`, it can't find Chinese hieroglyphs, Cyrillic letters, etc.
We also can use several classes, for example `pattern:[\s\S]` matches spaces or non-spaces -- any character. That's wider than a dot `"."`, because the dot matches any character except a newline (unless `pattern:s` flag is set).
We can write a more universal pattern, that looks for wordly characters in any language. That's easy with unicode properties: `pattern:[\p{Alpha}\p{M}\p{Nd}\p{Pc}\p{Join_C}]`.
Let's decipher it. Similar to `pattern:\w`, we're making a set of our own that includes characters with following unicode properties:
- `Alphabetic` (`Alpha`) - for letters,
- `Mark` (`M`) - for accents,
- `Decimal_Number` (`Nd`) - for digits,
- `Connector_Punctuation` (`Pc`) - for the underscore `'_'` and similar characters,
- `Join_Control` (`Join_C`) - two special codes `200c` and `200d`, used in ligatures, e.g. in Arabic.
An example of use:
```js run
let regexp = /[\p{Alpha}\p{M}\p{Nd}\p{Pc}\p{Join_C}]/gu;
let str = `Hi 你好 12`;
// finds all letters and digits:
alert( str.match(regexp) ); // H,i,你,好,1,2
```
Of course, we can edit this pattern: add unicode properties or remove them. Unicode properties are covered in more details in the article <info:regexp-unicode>.
```warn header="Unicode properties aren't supported in Edge and Firefox"
Unicode properties `pattern:p{…}` are not yet implemented in Edge and Firefox. If we really need them, we can use library [XRegExp](http://xregexp.com/).
Or just use ranges of characters in a language that interests us, e.g. `pattern:[а-я]` for Cyrillic letters.
```
## Excluding ranges
@ -78,22 +111,20 @@ The example below looks for any characters except letters, digits and spaces:
alert( "alice15@gmail.com".match(/[^\d\sA-Z]/gi) ); // @ and .
```
## No escaping in […]
## Escaping in […]
Usually when we want to find exactly the dot character, we need to escape it like `pattern:\.`. And if we need a backslash, then we use `pattern:\\`.
Usually when we want to find exactly a special character, we need to escape it like `pattern:\.`. And if we need a backslash, then we use `pattern:\\`, and so on.
In square brackets the vast majority of special characters can be used without escaping:
In square brackets we can use the vast majority of special characters without escaping:
- A dot `pattern:'.'`.
- A plus `pattern:'+'`.
- Parentheses `pattern:'( )'`.
- Dash `pattern:'-'` in the beginning or the end (where it does not define a range).
- A caret `pattern:'^'` if not in the beginning (where it means exclusion).
- And the opening square bracket `pattern:'['`.
- Symbols `pattern:. + ( )` never need escaping.
- A hyphen `pattern:-` is not escaped in the beginning or the end (where it does not define a range).
- A caret `pattern:^` is only escaped in the beginning (where it means exclusion).
- The closing square bracket `pattern:]` is always escaped (if we need to look for that symbol).
In other words, all special characters are allowed except where they mean something for square brackets.
In other words, all special characters are allowed without escaping, except when they mean something for square brackets.
A dot `"."` inside square brackets means just a dot. The pattern `pattern:[.,]` would look for one of characters: either a dot or a comma.
A dot `.` inside square brackets means just a dot. The pattern `pattern:[.,]` would look for one of characters: either a dot or a comma.
In the example below the regexp `pattern:[-().^+]` looks for one of the characters `-().^+`:
@ -112,3 +143,55 @@ let reg = /[\-\(\)\.\^\+]/g;
alert( "1 + 2 - 3".match(reg) ); // also works: +, -
```
## Ranges and flag "u"
If there are surrogate pairs in the set, flag `pattern:u` is required for them to work correctly.
For instance, let's look for `pattern:[𝒳𝒴]` in the string `subject:𝒳`:
```js run
alert( '𝒳'.match(/[𝒳𝒴]/) ); // shows a strange character, like [?]
// (the search was performed incorrectly, half-character returned)
```
The result is incorrect, because by default regular expressions "don't know" about surrogate pairs.
The regular expression engine thinks that `[𝒳𝒴]` -- are not two, but four characters:
1. left half of `𝒳` `(1)`,
2. right half of `𝒳` `(2)`,
3. left half of `𝒴` `(3)`,
4. right half of `𝒴` `(4)`.
We can see their codes like this:
```js run
for(let i=0; i<'𝒳𝒴'.length; i++) {
alert('𝒳𝒴'.charCodeAt(i)); // 55349, 56499, 55349, 56500
};
```
So, the example above finds and shows the left half of `𝒳`.
If we add flag `pattern:u`, then the behavior will be correct:
```js run
alert( '𝒳'.match(/[𝒳𝒴]/u) ); // 𝒳
```
The similar situation occurs when looking for a range, such as `[𝒳-𝒴]`.
If we forget to add flag `pattern:u`, there will be an error:
```js run
'𝒳'.match(/[𝒳-𝒴]/); // Error: Invalid regular expression
```
The reason is that without flag `pattern:u` surrogate pairs are perceived as two characters, so `[𝒳-𝒴]` is interpreted as `[<55349><56499>-<55349><56500>]` (every surrogate pair is replaced with its codes). Now it's easy to see that the range `56499-55349` is invalid: its starting code `56499` is greater than the end `55349`. That's the formal reason for the error.
With the flag `pattern:u` the pattern works correctly:
```js run
// look for characters from 𝒳 to 𝒵
alert( '𝒴'.match(/[𝒳-𝒵]/u) ); // 𝒴
```

View file

@ -2,7 +2,7 @@
Let's say we have a string like `+7(903)-123-45-67` and want to find all numbers in it. But unlike before, we are interested not in single digits, but full numbers: `7, 903, 123, 45, 67`.
A number is a sequence of 1 or more digits `pattern:\d`. To mark how many we need, we need to append a *quantifier*.
A number is a sequence of 1 or more digits `pattern:\d`. To mark how many we need, we can append a *quantifier*.
## Quantity {n}
@ -12,7 +12,7 @@ A quantifier is appended to a character (or a character class, or a `[...]` set
It has a few advanced forms, let's see examples:
The exact count: `{5}`
The exact count: `pattern:{5}`
: `pattern:\d{5}` denotes exactly 5 digits, the same as `pattern:\d\d\d\d\d`.
The example below looks for a 5-digit number:
@ -23,7 +23,7 @@ The exact count: `{5}`
We can add `\b` to exclude longer numbers: `pattern:\b\d{5}\b`.
The range: `{3,5}`, match 3-5 times
The range: `pattern:{3,5}`, match 3-5 times
: To find numbers from 3 to 5 digits we can put the limits into curly braces: `pattern:\d{3,5}`
```js run
@ -54,8 +54,8 @@ alert(numbers); // 7,903,123,45,67
There are shorthands for most used quantifiers:
`+`
: Means "one or more", the same as `{1,}`.
`pattern:+`
: Means "one or more", the same as `pattern:{1,}`.
For instance, `pattern:\d+` looks for numbers:
@ -65,8 +65,8 @@ There are shorthands for most used quantifiers:
alert( str.match(/\d+/g) ); // 7,903,123,45,67
```
`?`
: Means "zero or one", the same as `{0,1}`. In other words, it makes the symbol optional.
`pattern:?`
: Means "zero or one", the same as `pattern:{0,1}`. In other words, it makes the symbol optional.
For instance, the pattern `pattern:ou?r` looks for `match:o` followed by zero or one `match:u`, and then `match:r`.
@ -78,16 +78,16 @@ There are shorthands for most used quantifiers:
alert( str.match(/colou?r/g) ); // color, colour
```
`*`
: Means "zero or more", the same as `{0,}`. That is, the character may repeat any times or be absent.
`pattern:*`
: Means "zero or more", the same as `pattern:{0,}`. That is, the character may repeat any times or be absent.
For example, `pattern:\d0*` looks for a digit followed by any number of zeroes:
For example, `pattern:\d0*` looks for a digit followed by any number of zeroes (may be many or none):
```js run
alert( "100 10 1".match(/\d0*/g) ); // 100, 10, 1
```
Compare it with `'+'` (one or more):
Compare it with `pattern:+` (one or more):
```js run
alert( "100 10 1".match(/\d0+/g) ); // 100, 10
@ -98,43 +98,45 @@ There are shorthands for most used quantifiers:
Quantifiers are used very often. They serve as the main "building block" of complex regular expressions, so let's see more examples.
Regexp "decimal fraction" (a number with a floating point): `pattern:\d+\.\d+`
: In action:
```js run
alert( "0 1 12.345 7890".match(/\d+\.\d+/g) ); // 12.345
```
**Regexp for decimal fractions (a number with a floating point): `pattern:\d+\.\d+`**
Regexp "open HTML-tag without attributes", like `<span>` or `<p>`: `pattern:/<[a-z]+>/i`
: In action:
In action:
```js run
alert( "0 1 12.345 7890".match(/\d+\.\d+/g) ); // 12.345
```
**Regexp for an "opening HTML-tag without attributes", such as `<span>` or `<p>`.**
1. The simplest one: `pattern:/<[a-z]+>/i`
```js run
alert( "<body> ... </body>".match(/<[a-z]+>/gi) ); // <body>
```
We look for character `pattern:'<'` followed by one or more Latin letters, and then `pattern:'>'`.
The regexp looks for character `pattern:'<'` followed by one or more Latin letters, and then `pattern:'>'`.
Regexp "open HTML-tag without attributes" (improved): `pattern:/<[a-z][a-z0-9]*>/i`
: Better regexp: according to the standard, HTML tag name may have a digit at any position except the first one, like `<h1>`.
2. Improved: `pattern:/<[a-z][a-z0-9]*>/i`
According to the standard, HTML tag name may have a digit at any position except the first one, like `<h1>`.
```js run
alert( "<h1>Hi!</h1>".match(/<[a-z][a-z0-9]*>/gi) ); // <h1>
```
Regexp "opening or closing HTML-tag without attributes": `pattern:/<\/?[a-z][a-z0-9]*>/i`
: We added an optional slash `pattern:/?` before the tag. Had to escape it with a backslash, otherwise JavaScript would think it is the pattern end.
**Regexp "opening or closing HTML-tag without attributes": `pattern:/<\/?[a-z][a-z0-9]*>/i`**
```js run
alert( "<h1>Hi!</h1>".match(/<\/?[a-z][a-z0-9]*>/gi) ); // <h1>, </h1>
```
We added an optional slash `pattern:/?` near the beginning of the pattern. Had to escape it with a backslash, otherwise JavaScript would think it is the pattern end.
```js run
alert( "<h1>Hi!</h1>".match(/<\/?[a-z][a-z0-9]*>/gi) ); // <h1>, </h1>
```
```smart header="To make a regexp more precise, we often need make it more complex"
We can see one common rule in these examples: the more precise is the regular expression -- the longer and more complex it is.
For instance, for HTML tags we could use a simpler regexp: `pattern:<\w+>`.
For instance, for HTML tags we could use a simpler regexp: `pattern:<\w+>`. But as HTML has stricter restrictions for a tag name, `pattern:<[a-z][a-z0-9]*>` is more reliable.
...But because `pattern:\w` means any Latin letter or a digit or `'_'`, the regexp also matches non-tags, for instance `match:<_>`. So it's much simpler than `pattern:<[a-z][a-z0-9]*>`, but less reliable.
Can we use `pattern:<\w+>` or we need `pattern:<[a-z][a-z0-9]*>`?
Are we ok with `pattern:<\w+>` or we need `pattern:<[a-z][a-z0-9]*>`?
In real life both variants are acceptable. Depends on how tolerant we can be to "extra" matches and whether it's difficult or not to filter them out by other means.
In real life both variants are acceptable. Depends on how tolerant we can be to "extra" matches and whether it's difficult or not to remove them from the result by other means.
```

View file

@ -1,13 +1,11 @@
We need to find the beginning of the comment `match:<!--`, then everything till the end of `match:-->`.
The first idea could be `pattern:<!--.*?-->` -- the lazy quantifier makes the dot stop right before `match:-->`.
An acceptable variant is `pattern:<!--.*?-->` -- the lazy quantifier makes the dot stop right before `match:-->`. We also need to add flag `pattern:s` for the dot to include newlines.
But a dot in JavaScript means "any symbol except the newline". So multiline comments won't be found.
We can use `pattern:[\s\S]` instead of the dot to match "anything":
Otherwise multiline comments won't be found:
```js run
let reg = /<!--[\s\S]*?-->/g;
let reg = /<!--.*?-->/gs;
let str = `... <!-- My -- comment
test --> .. <!----> ..

View file

@ -8,7 +8,7 @@ Let's take the following task as an example.
We have a text and need to replace all quotes `"..."` with guillemet marks: `«...»`. They are preferred for typography in many countries.
For instance: `"Hello, world"` should become `«Hello, world»`. Some countries prefer other quotes, like `„Witam, świat!”` (Polish) or `「你好,世界」` (Chinese), but for our task let's choose `«...»`.
For instance: `"Hello, world"` should become `«Hello, world»`. There exist other quotes, such as `„Witam, świat!”` (Polish) or `「你好,世界」` (Chinese), but for our task let's choose `«...»`.
The first thing to do is to locate quoted strings, and then we can replace them.
@ -35,7 +35,7 @@ That can be described as "greediness is the cause of all evil".
To find a match, the regular expression engine uses the following algorithm:
- For every position in the string
- Match the pattern at that position.
- Try to match the pattern at that position.
- If there's no match, go to the next position.
These common words do not make it obvious why the regexp fails, so let's elaborate how the search works for the pattern `pattern:".+"`.
@ -44,7 +44,7 @@ These common words do not make it obvious why the regexp fails, so let's elabora
The regular expression engine tries to find it at the zero position of the source string `subject:a "witch" and her "broom" is one`, but there's `subject:a` there, so there's immediately no match.
Then it advances: goes to the next positions in the source string and tries to find the first character of the pattern there, and finally finds the quote at the 3rd position:
Then it advances: goes to the next positions in the source string and tries to find the first character of the pattern there, fails again, and finally finds the quote at the 3rd position:
![](witch_greedy1.svg)
@ -54,13 +54,13 @@ These common words do not make it obvious why the regexp fails, so let's elabora
![](witch_greedy2.svg)
3. Then the dot repeats because of the quantifier `pattern:.+`. The regular expression engine builds the match by taking characters one by one while it is possible.
3. Then the dot repeats because of the quantifier `pattern:.+`. The regular expression engine adds to the match one character after another.
...When does it become impossible? All characters match the dot, so it only stops when it reaches the end of the string:
...Until when? All characters match the dot, so it only stops when it reaches the end of the string:
![](witch_greedy3.svg)
4. Now the engine finished repeating for `pattern:.+` and tries to find the next character of the pattern. It's the quote `pattern:"`. But there's a problem: the string has finished, there are no more characters!
4. Now the engine finished repeating `pattern:.+` and tries to find the next character of the pattern. It's the quote `pattern:"`. But there's a problem: the string has finished, there are no more characters!
The regular expression engine understands that it took too many `pattern:.+` and starts to *backtrack*.
@ -68,9 +68,9 @@ These common words do not make it obvious why the regexp fails, so let's elabora
![](witch_greedy4.svg)
Now it assumes that `pattern:.+` ends one character before the end and tries to match the rest of the pattern from that position.
Now it assumes that `pattern:.+` ends one character before the string end and tries to match the rest of the pattern from that position.
If there were a quote there, then that would be the end, but the last character is `subject:'e'`, so there's no match.
If there were a quote there, then the search would end, but the last character is `subject:'e'`, so there's no match.
5. ...So the engine decreases the number of repetitions of `pattern:.+` by one more character:
@ -84,19 +84,19 @@ These common words do not make it obvious why the regexp fails, so let's elabora
7. The match is complete.
8. So the first match is `match:"witch" and her "broom"`. The further search starts where the first match ends, but there are no more quotes in the rest of the string `subject:is one`, so no more results.
8. So the first match is `match:"witch" and her "broom"`. If the regular expression has flag `pattern:g`, then the search will continue from where the first match ends. There are no more quotes in the rest of the string `subject:is one`, so no more results.
That's probably not what we expected, but that's how it works.
**In the greedy mode (by default) the quantifier is repeated as many times as possible.**
**In the greedy mode (by default) a quantifier is repeated as many times as possible.**
The regexp engine tries to fetch as many characters as it can by `pattern:.+`, and then shortens that one by one.
The regexp engine adds to the match as many characters as it can for `pattern:.+`, and then shortens that one by one, if the rest of the pattern doesn't match.
For our task we want another thing. That's what the lazy quantifier mode is for.
For our task we want another thing. That's where a lazy mode can help.
## Lazy mode
The lazy mode of quantifier is an opposite to the greedy mode. It means: "repeat minimal number of times".
The lazy mode of quantifiers is an opposite to the greedy mode. It means: "repeat minimal number of times".
We can enable it by putting a question mark `pattern:'?'` after the quantifier, so that it becomes `pattern:*?` or `pattern:+?` or even `pattern:??` for `pattern:'?'`.
@ -149,20 +149,19 @@ Other quantifiers remain greedy.
For instance:
```js run
alert( "123 456".match(/\d+ \d+?/g) ); // 123 4
alert( "123 456".match(/\d+ \d+?/) ); // 123 4
```
1. The pattern `pattern:\d+` tries to match as many numbers as it can (greedy mode), so it finds `match:123` and stops, because the next character is a space `pattern:' '`.
2. Then there's a space in pattern, it matches.
1. The pattern `pattern:\d+` tries to match as many digits as it can (greedy mode), so it finds `match:123` and stops, because the next character is a space `pattern:' '`.
2. Then there's a space in the pattern, it matches.
3. Then there's `pattern:\d+?`. The quantifier is in lazy mode, so it finds one digit `match:4` and tries to check if the rest of the pattern matches from there.
...But there's nothing in the pattern after `pattern:\d+?`.
The lazy mode doesn't repeat anything without a need. The pattern finished, so we're done. We have a match `match:123 4`.
4. The next search starts from the character `5`.
```smart header="Optimizations"
Modern regular expression engines can optimize internal algorithms to work faster. So they may work a bit different from the described algorithm.
Modern regular expression engines can optimize internal algorithms to work faster. So they may work a bit differently from the described algorithm.
But to understand how regular expressions work and to build regular expressions, we don't need to know about that. They are only used internally to optimize things.
@ -264,7 +263,7 @@ That's what's going on:
2. Then it looks for `pattern:.*?`: takes one character (lazily!), check if there's a match for `pattern:" class="doc">` (none).
3. Then takes another character into `pattern:.*?`, and so on... until it finally reaches `match:" class="doc">`.
But the problem is: that's already beyond the link, in another tag `<p>`. Not what we want.
But the problem is: that's already beyond the link `<a...>`, in another tag `<p>`. Not what we want.
Here's the picture of the match aligned with the text:
@ -273,11 +272,9 @@ Here's the picture of the match aligned with the text:
<a href="link1" class="wrong">... <p style="" class="doc">
```
So the laziness did not work for us here.
So, we need the pattern to look for `<a href="...something..." class="doc">`, but both greedy and lazy variants have problems.
We need the pattern to look for `<a href="...something..." class="doc">`, but both greedy and lazy variants have problems.
The correct variant would be: `pattern:href="[^"]*"`. It will take all characters inside the `href` attribute till the nearest quote, just what we need.
The correct variant can be: `pattern:href="[^"]*"`. It will take all characters inside the `href` attribute till the nearest quote, just what we need.
A working example:
@ -301,4 +298,4 @@ Greedy
Lazy
: Enabled by the question mark `pattern:?` after the quantifier. The regexp engine tries to match the rest of the pattern before each repetition of the quantifier.
As we've seen, the lazy mode is not a "panacea" from the greedy search. An alternative is a "fine-tuned" greedy search, with exclusions. Soon we'll see more examples of it.
As we've seen, the lazy mode is not a "panacea" from the greedy search. An alternative is a "fine-tuned" greedy search, with exclusions, as in the pattern `pattern:"[^"]+"`.

View file

@ -1,12 +1,10 @@
A regexp to search 3-digit color `#abc`: `pattern:/#[a-f0-9]{3}/i`.
We can add exactly 3 more optional hex digits. We don't need more or less. Either we have them or we don't.
We can add exactly 3 more optional hex digits. We don't need more or less. The color has either 3 or 6 digits.
The simplest way to add them -- is to append to the regexp: `pattern:/#[a-f0-9]{3}([a-f0-9]{3})?/i`
Let's use the quantifier `pattern:{1,2}` for that: we'll have `pattern:/#([a-f0-9]{3}){1,2}/i`.
We can do it in a smarter way though: `pattern:/#([a-f0-9]{3}){1,2}/i`.
Here the regexp `pattern:[a-f0-9]{3}` is in parentheses to apply the quantifier `pattern:{1,2}` to it as a whole.
Here the pattern `pattern:[a-f0-9]{3}` is enclosed in parentheses to apply the quantifier `pattern:{1,2}`.
In action:

View file

@ -11,4 +11,4 @@ let str = "color: #3f3; background-color: #AA00ef; and: #abcd";
alert( str.match(reg) ); // #3f3 #AA00ef
```
P.S. This should be exactly 3 or 6 hex digits: values like `#abcd` should not match.
P.S. This should be exactly 3 or 6 hex digits. Values with 4 digits, such as `#abcd`, should not match.

View file

@ -1,6 +1,6 @@
A positive number with an optional decimal part is (per previous task): `pattern:\d+(\.\d+)?`.
Let's add an optional `-` in the beginning:
Let's add the optional `pattern:-` in the beginning:
```js run
let reg = /-?\d+(\.\d+)?/g;

View file

@ -1,16 +1,19 @@
A regexp for a number is: `pattern:-?\d+(\.\d+)?`. We created it in previous tasks.
An operator is `pattern:[-+*/]`.
An operator is `pattern:[-+*/]`. The hyphen `pattern:-` goes first in the square brackets, because in the middle it would mean a character range, while we just want a character `-`.
Please note:
- Here the dash `pattern:-` goes first in the brackets, because in the middle it would mean a character range, while we just want a character `-`.
- A slash `/` should be escaped inside a JavaScript regexp `pattern:/.../`, we'll do that later.
The slash `/` should be escaped inside a JavaScript regexp `pattern:/.../`, we'll do that later.
We need a number, an operator, and then another number. And optional spaces between them.
The full regular expression: `pattern:-?\d+(\.\d+)?\s*[-+*/]\s*-?\d+(\.\d+)?`.
To get a result as an array let's put parentheses around the data that we need: numbers and the operator: `pattern:(-?\d+(\.\d+)?)\s*([-+*/])\s*(-?\d+(\.\d+)?)`.
It has 3 parts, with `pattern:\s*` between them:
1. `pattern:-?\d+(\.\d+)?` - the first number,
1. `pattern:[-+*/]` - the operator,
1. `pattern:-?\d+(\.\d+)?` - the second number.
To make each of these parts a separate element of the result array, let's enclose them in parentheses: `pattern:(-?\d+(\.\d+)?)\s*([-+*/])\s*(-?\d+(\.\d+)?)`.
In action:
@ -29,11 +32,11 @@ The result includes:
- `result[4] == "12"` (forth group `(-?\d+(\.\d+)?)` -- the second number)
- `result[5] == undefined` (fifth group `(\.\d+)?` -- the last decimal part is absent, so it's undefined)
We only want the numbers and the operator, without the full match or the decimal parts.
We only want the numbers and the operator, without the full match or the decimal parts, so let's "clean" the result a bit.
The full match (the arrays first item) can be removed by shifting the array `pattern:result.shift()`.
The full match (the arrays first item) can be removed by shifting the array `result.shift()`.
The decimal groups can be removed by making them into non-capturing groups, by adding `pattern:?:` to the beginning: `pattern:(?:\.\d+)?`.
Groups that contain decimal parts (number 2 and 4) `pattern:(.\d+)` can be excluded by adding `pattern:?:` to the beginning: `pattern:(?:\.\d+)?`.
The final solution:

View file

@ -4,83 +4,92 @@ A part of a pattern can be enclosed in parentheses `pattern:(...)`. This is call
That has two effects:
1. It allows to place a part of the match into a separate array.
2. If we put a quantifier after the parentheses, it applies to the parentheses as a whole, not the last character.
1. It allows to get a part of the match as a separate item in the result array.
2. If we put a quantifier after the parentheses, it applies to the parentheses as a whole.
## Example
## Examples
In the example below the pattern `pattern:(go)+` finds one or more `match:'go'`:
Let's see how parentheses work in examples.
### Example: gogogo
Without parentheses, the pattern `pattern:go+` means `subject:g` character, followed by `subject:o` repeated one or more times. For instance, `match:goooo` or `match:gooooooooo`.
Parentheses group characters together, so `pattern:(go)+` means `match:go`, `match:gogo`, `match:gogogo` and so on.
```js run
alert( 'Gogogo now!'.match(/(go)+/i) ); // "Gogogo"
```
Without parentheses, the pattern `pattern:/go+/` means `subject:g`, followed by `subject:o` repeated one or more times. For instance, `match:goooo` or `match:gooooooooo`.
### Example: domain
Parentheses group the word `pattern:(go)` together.
Let's make something more complex -- a regular expression to search for a website domain.
Let's make something more complex -- a regexp to match an email.
Examples of emails:
For example:
```
my@mail.com
john.smith@site.com.uk
mail.com
users.mail.com
smith.users.mail.com
```
The pattern: `pattern:[-.\w]+@([\w-]+\.)+[\w-]{2,20}`.
As we can see, a domain consists of repeated words, a dot after each one except the last one.
1. The first part `pattern:[-.\w]+` (before `@`) may include any alphanumeric word characters, a dot and a dash, to match `match:john.smith`.
2. Then `pattern:@`, and the domain. It may be a subdomain like `host.site.com.uk`, so we match it as "a word followed by a dot `pattern:([\w-]+\.)` (repeated), and then the last part must be a word: `match:com` or `match:uk` (but not very long: 2-20 characters).
That regexp is not perfect, but good enough to fix errors or occasional mistypes.
For instance, we can find all emails in the string:
In regular expressions that's `pattern:(\w+\.)+\w+`:
```js run
let reg = /[-.\w]+@([\w-]+\.)+[\w-]{2,20}/g;
let regexp = /(\w+\.)+\w+/g;
alert( "site.com my.site.com".match(regexp) ); // site.com,my.site.com
```
The search works, but the pattern can't match a domain with a hyphen, e.g. `my-site.com`, because the hyphen does not belong to class `pattern:\w`.
We can fix it by replacing `pattern:\w` with `pattern:[\w-]` in every word except the last one: `pattern:([\w-]+\.)+\w+`.
### Example: email
The previous example can be extended. We can create a regular expression for emails based on it.
The email format is: `name@domain`. Any word can be the name, hyphens and dots are allowed. In regular expressions that's `pattern:[-.\w]+`.
The pattern:
```js run
let reg = /[-.\w]+@([\w-]+\.)+[\w-]+/g;
alert("my@mail.com @ his@site.com.uk".match(reg)); // my@mail.com, his@site.com.uk
```
In this example parentheses were used to make a group for repetitions `pattern:([\w-]+\.)+`. But there are other uses too, let's see them.
That regexp is not perfect, but mostly works and helps to fix accidental mistypes. The only truly reliable check for an email can only be done by sending a letter.
## Contents of parentheses
## Parentheses contents in the match
Parentheses are numbered from left to right. The search engine remembers the content matched by each of them and allows to reference it in the pattern or in the replacement string.
Parentheses are numbered from left to right. The search engine remembers the content matched by each of them and allows to get it in the result.
For instance, we'd like to find HTML tags `pattern:<.*?>`, and process them.
The method `str.match(regexp)`, if `regexp` has no flag `g`, looks for the first match and returns it as an array:
1. At index `0`: the full match.
2. At index `1`: the contents of the first parentheses.
3. На позиции `2`: the contents of the second parentheses.
4. ...and so on...
For instance, we'd like to find HTML tags `pattern:<.*?>`, and process them. It would be convenient to have tag content (what's inside the angles), in a separate variable.
Let's wrap the inner content into parentheses, like this: `pattern:<(.*?)>`.
Then we'll get both the tag as a whole and its content:
```js run
let str = '<h1>Hello, world!</h1>';
let reg = /<(.*?)>/;
alert( str.match(reg) ); // Array: ["<h1>", "h1"]
```
The call to [String#match](mdn:js/String/match) returns groups only if the regexp only looks for the first match, that is: has no `pattern:/.../g` flag.
If we need all matches with their groups then we can use `.matchAll` or `regexp.exec` as described in <info:regexp-methods>:
Now we'll get both the tag as a whole `match:<h1>` and its contents `match:h1` in the resulting array:
```js run
let str = '<h1>Hello, world!</h1>';
// two matches: opening <h1> and closing </h1> tags
let reg = /<(.*?)>/g;
let tag = str.match(/<(.*?)>/);
let matches = Array.from( str.matchAll(reg) );
alert(matches[0]); // Array: ["<h1>", "h1"]
alert(matches[1]); // Array: ["</h1>", "/h1"]
alert( tag[0] ); // <h1>
alert( tag[1] ); // h1
```
Here we have two matches for `pattern:<(.*?)>`, each of them is an array with the full match and groups.
## Nested groups
### Nested groups
Parentheses can be nested. In this case the numbering also goes from left to right.
@ -90,7 +99,13 @@ For instance, when searching a tag in `subject:<span class="my">` we may be inte
2. The tag name: `match:span`.
3. The tag attributes: `match:class="my"`.
Let's add parentheses for them:
Let's add parentheses for them: `pattern:<(([a-z]+)\s*([^>]*))>`.
Here's how they are numbered (left to right, by the opening paren):
![](regexp-nested-groups-pattern.svg)
In action:
```js run
let str = '<span class="my">';
@ -98,20 +113,25 @@ let str = '<span class="my">';
let reg = /<(([a-z]+)\s*([^>]*))>/;
let result = str.match(reg);
alert(result); // <span class="my">, span class="my", span, class="my"
alert(result[0]); // <span class="my">
alert(result[1]); // span class="my"
alert(result[2]); // span
alert(result[3]); // class="my"
```
Here's how groups look:
The zero index of `result` always holds the full match.
![](regexp-nested-groups.svg)
Then groups, numbered from left to right by an opening paren. The first group is returned as `result[1]`. Here it encloses the whole tag content.
At the zero index of the `result` is always the full match.
Then in `result[2]` goes the group from the second opening paren `pattern:([a-z]+)` - tag name, then in `result[3]` the tag: `pattern:([^>]*)`.
Then groups, numbered from left to right. Whichever opens first gives the first group `result[1]`. Here it encloses the whole tag content.
The contents of every group in the string:
Then in `result[2]` goes the group from the second opening `pattern:(` till the corresponding `pattern:)` -- tag name, then we don't group spaces, but group attributes for `result[3]`.
![](regexp-nested-groups-matches.svg)
**Even if a group is optional and doesn't exist in the match, the corresponding `result` array item is present (and equals `undefined`).**
### Optional groups
Even if a group is optional and doesn't exist in the match (e.g. has the quantifier `pattern:(...)?`), the corresponding `result` array item is present and equals `undefined`.
For instance, let's consider the regexp `pattern:a(z)?(c)?`. It looks for `"a"` optionally followed by `"z"` optionally followed by `"c"`.
@ -128,10 +148,10 @@ alert( match[2] ); // undefined
The array has the length of `3`, but all groups are empty.
And here's a more complex match for the string `subject:ack`:
And here's a more complex match for the string `subject:ac`:
```js run
let match = 'ack'.match(/a(z)?(c)?/)
let match = 'ac'.match(/a(z)?(c)?/)
alert( match.length ); // 3
alert( match[0] ); // ac (whole match)
@ -141,11 +161,90 @@ alert( match[2] ); // c
The array length is permanent: `3`. But there's nothing for the group `pattern:(z)?`, so the result is `["ac", undefined, "c"]`.
## Searching for all matches with groups: matchAll
```warn header="`matchAll` is a new method, polyfill may be needed"
The method `matchAll` is not supported in old browsers.
A polyfill may be required, such as <https://github.com/ljharb/String.prototype.matchAll>.
```
When we search for all matches (flag `pattern:g`), the `match` method does not return contents for groups.
For example, let's find all tags in a string:
```js run
let str = '<h1> <h2>';
let tags = str.match(/<(.*?)>/g);
alert( tags ); // <h1>,<h2>
```
The result is an array of matches, but without details about each of them. But in practice we usually need contents of capturing groups in the result.
To get them, we should search using the method `str.matchAll(regexp)`.
It was added to JavaScript language long after `match`, as its "new and improved version".
Just like `match`, it looks for matches, but there are 3 differences:
1. It returns not an array, but an iterable object.
2. When the flag `pattern:g` is present, it returns every match as an array with groups.
3. If there are no matches, it returns not `null`, but an empty iterable object.
For instance:
```js run
let results = '<h1> <h2>'.matchAll(/<(.*?)>/gi);
// results - is not an array, but an iterable object
alert(results); // [object RegExp String Iterator]
alert(results[0]); // undefined
results = Array.from(results); // let's turn it into array
alert(results[0]); // <h1>,h1 (1st tag)
alert(results[1]); // <h2>,h2 (2nd tag)
```
As we can see, the first difference is very important. We can't get the match as `results[0]`, because that object isn't pseudoarray. We can turn it into a real `Array` using `Array.from`. There are more details about pseudoarrays and iterables in the article <info:iterable>.
There's no need in `Array.from` if we're looping over results:
```js run
let results = '<h1> <h2>'.matchAll(/<(.*?)>/gi);
for(let result of results) {
alert(result);
// первый вывод: <h1>,h1
// второй: <h2>,h2
}
```
...Or using destructuring:
```js
let [tag1, tag2] = '<h1> <h2>'.matchAll(/<(.*?)>/gi);
```
```smart header="Why is a result of `matchAll` an iterable object, not an array?"
Why is the method designed like that? The reason is simple - for the optimization.
The call to `matchAll` does not perform the search. Instead, it returns an iterable object, without the results initially. The search is performed each time we iterate over it, e.g. in the loop.
So, there will be found as many results as needed, not more.
E.g. there are potentially 100 matches in the text, but in a `for..of` loop we found 5 of them, then decided it's enough and make a `break`. Then the engine won't spend time finding other 95 mathces.
```
## Named groups
Remembering groups by their numbers is hard. For simple patterns it's doable, but for more complex ones we can give names to parentheses.
Remembering groups by their numbers is hard. For simple patterns it's doable, but for more complex ones counting parentheses is inconvenient. We have a much better option: give names to parentheses.
That's done by putting `pattern:?<name>` immediately after the opening paren, like this:
That's done by putting `pattern:?<name>` immediately after the opening paren.
For example, let's look for a date in the format "year-month-day":
```js run
*!*
@ -162,71 +261,75 @@ alert(groups.day); // 30
As you can see, the groups reside in the `.groups` property of the match.
We can also use them in the replacement string, as `pattern:$<name>` (like `$1..9`, but a name instead of a digit).
To look for all dates, we can add flag `pattern:g`.
For instance, let's reformat the date into `day.month.year`:
We'll also need `matchAll` to obtain full matches, together with groups:
```js run
let dateRegexp = /(?<year>[0-9]{4})-(?<month>[0-9]{2})-(?<day>[0-9]{2})/;
let dateRegexp = /(?<year>[0-9]{4})-(?<month>[0-9]{2})-(?<day>[0-9]{2})/g;
let str = "2019-04-30";
let str = "2019-10-30 2020-01-01";
let rearranged = str.replace(dateRegexp, '$<day>.$<month>.$<year>');
let results = str.matchAll(dateRegexp);
alert(rearranged); // 30.04.2019
for(let result of results) {
let {year, month, day} = result.groups;
alert(`${day}.${month}.${year}`);
// first alert: 30.10.2019
// second: 01.01.2020
}
```
If we use a function for the replacement, then named `groups` object is always the last argument:
## Capturing groups in replacement
Method `str.replace(regexp, replacement)` that replaces all matches with `regexp` in `str` allows to use parentheses contents in the `replacement` string. That's done using `pattern:$n`, where `pattern:n` is the group number.
For example,
```js run
let dateRegexp = /(?<year>[0-9]{4})-(?<month>[0-9]{2})-(?<day>[0-9]{2})/;
let str = "John Bull";
let regexp = /(\w+) (\w+)/;
let str = "2019-04-30";
let rearranged = str.replace(dateRegexp,
(str, year, month, day, offset, input, groups) =>
`${groups.day}.${groups.month}.${groups.year}`
);
alert(rearranged); // 30.04.2019
alert( str.replace(regexp, '$2, $1') ); // Bull, John
```
Usually, when we intend to use named groups, we don't need positional arguments of the function. For the majority of real-life cases we only need `str` and `groups`.
For named parentheses the reference will be `pattern:$<name>`.
So we can write it a little bit shorter:
For example, let's reformat dates from "year-month-day" to "day.month.year":
```js
let rearranged = str.replace(dateRegexp, (str, ...args) => {
let {year, month, day} = args.pop();
alert(str); // 2019-04-30
alert(year); // 2019
alert(month); // 04
alert(day); // 30
});
```js run
let regexp = /(?<year>[0-9]{4})-(?<month>[0-9]{2})-(?<day>[0-9]{2})/g;
let str = "2019-10-30, 2020-01-01";
alert( str.replace(regexp, '$<day>.$<month>.$<year>') );
// 30.10.2019, 01.01.2020
```
## Non-capturing groups with ?:
Sometimes we need parentheses to correctly apply a quantifier, but we don't want their contents in results.
A group may be excluded by adding `pattern:?:` in the beginning.
For instance, if we want to find `pattern:(go)+`, but don't want to remember the contents (`go`) in a separate array item, we can write: `pattern:(?:go)+`.
For instance, if we want to find `pattern:(go)+`, but don't want the parentheses contents (`go`) as a separate array item, we can write: `pattern:(?:go)+`.
In the example below we only get the name "John" as a separate member of the `results` array:
In the example below we only get the name `match:John` as a separate member of the match:
```js run
let str = "Gogo John!";
let str = "Gogogo John!";
*!*
// exclude Gogo from capturing
// ?: exludes 'go' from capturing
let reg = /(?:go)+ (\w+)/i;
*/!*
let result = str.match(reg);
alert( result.length ); // 2
alert( result[0] ); // Gogogo John (full match)
alert( result[1] ); // John
alert( result.length ); // 2 (no more items in the array)
```
## Summary
@ -235,8 +338,13 @@ Parentheses group together a part of the regular expression, so that the quantif
Parentheses groups are numbered left-to-right, and can optionally be named with `(?<name>...)`.
The content, matched by a group, can be referenced both in the replacement string as `$1`, `$2` etc, or by the name `$name` if named.
The content, matched by a group, can be obtained in the results:
So, parentheses groups are called "capturing groups", as they "capture" a part of the match. We get that part separately from the result as a member of the array or in `.groups` if it's named.
- The method `str.match` returns capturing groups only without flag `pattern:g`.
- The method `str.matchAll` always returns capturing groups.
We can exclude the group from remembering (make in "non-capturing") by putting `?:` at the start: `(?:...)`, that's used if we'd like to apply a quantifier to the whole group, but don't need it in the result.
If the parentheses have no name, then their contents is available in the match array by its number. Named parentheses are also available in the property `groups`.
We can also use parentheses contents in the replacement string in `str.replace`: by the number `$n` or the name `$<name>`.
A group may be excluded from remembering by adding `pattern:?:` in its start. That's used when we need to apply a quantifier to the whole group, but don't remember it as a separate item in the results array. We also can't reference such parentheses in the replacement string.

View file

Before

Width:  |  Height:  |  Size: 2.8 KiB

After

Width:  |  Height:  |  Size: 2.8 KiB

Before After
Before After

View file

@ -0,0 +1 @@
<svg xmlns="http://www.w3.org/2000/svg" width="320" height="130" viewBox="0 0 320 130"><defs><style>@import url(https://fonts.googleapis.com/css?family=Open+Sans:bold,italic,bolditalic%7CPT+Mono);@font-face{font-family:&apos;PT Mono&apos;;font-weight:700;font-style:normal;src:local(&apos;PT MonoBold&apos;),url(/font/PTMonoBold.woff2) format(&apos;woff2&apos;),url(/font/PTMonoBold.woff) format(&apos;woff&apos;),url(/font/PTMonoBold.ttf) format(&apos;truetype&apos;)}</style></defs><g id="regexp" fill="none" fill-rule="evenodd" stroke="none" stroke-width="1"><g id="regexp-nested-groups.svg"><text id="&lt;(([a-z]+)\s*([^&gt;]*)" font-family="PTMono-Regular, PT Mono" font-size="22" font-weight="normal"><tspan x="20" y="75" fill="#8A704D">&lt;</tspan> <tspan x="33.2" y="75" fill="#DB2023">((</tspan> <tspan x="59.6" y="75" fill="#8A704D">[a-z]+</tspan> <tspan x="138.8" y="75" fill="#DB2023">)</tspan> <tspan x="152" y="75" fill="#8A704D">\s*</tspan> <tspan x="191.6" y="75" fill="#DB2023">(</tspan> <tspan x="204.8" y="75" fill="#8A704D">[^&gt;]*</tspan> <tspan x="270.8" y="75" fill="#D0021B">))</tspan> <tspan x="297.2" y="75" fill="#8A704D">&gt;</tspan></text><path id="Line" stroke="#D0021B" stroke-linecap="square" d="M42.5 45.646V29.354"/><path id="Line-2" stroke="#D0021B" stroke-linecap="square" d="M290.5 45.646V29.354"/><path id="Line" stroke="#D0021B" stroke-linecap="square" d="M42.5 28.5h248"/><path id="Line-5" stroke="#D0021B" stroke-linecap="square" d="M52.5 101.646V85.354"/><path id="Line-4" stroke="#D0021B" stroke-linecap="square" d="M145.5 101.646V85.354"/><path id="Line-3" stroke="#D0021B" stroke-linecap="square" d="M52.5 102.5h93"/><text id="1" fill="#D0021B" font-family="PTMono-Regular, PT Mono" font-size="20" font-weight="normal"><tspan x="24" y="44">1</tspan></text><text id="span-class=&quot;my&quot;" fill="#417505" font-family="PTMono-Regular, PT Mono" font-size="20" font-weight="normal"><tspan x="82" y="23">span class=&quot;my&quot;</tspan></text><text id="2" fill="#D0021B" font-family="PTMono-Regular, PT Mono" font-size="20" font-weight="normal"><tspan x="35" y="101">2</tspan></text><text id="span" fill="#417505" font-family="PTMono-Regular, PT Mono" font-size="20" font-weight="normal"><tspan x="73" y="119">span</tspan></text><path id="Line-8" stroke="#D0021B" stroke-linecap="square" d="M197.5 101.646V85.354"/><path id="Line-7" stroke="#D0021B" stroke-linecap="square" d="M277.5 101.646V85.354"/><path id="Line-6" stroke="#D0021B" stroke-linecap="square" d="M197.5 102.5h80"/><text id="3" fill="#D0021B" font-family="PTMono-Regular, PT Mono" font-size="20" font-weight="normal"><tspan x="182" y="101">3</tspan></text><text id="class=&quot;my&quot;" fill="#417505" font-family="PTMono-Regular, PT Mono" font-size="20" font-weight="normal"><tspan x="185" y="121">class=&quot;my&quot;</tspan></text></g></g></svg>

After

Width:  |  Height:  |  Size: 2.8 KiB