renames
131
5-regular-expressions/01-regexp-introduction/article.md
Normal file
|
@ -0,0 +1,131 @@
|
|||
# Patterns and flags
|
||||
|
||||
Regular expressions is a powerful way of searching and replacing inside a string.
|
||||
|
||||
In JavaScript regular expressions are implemented using objects of a built-in `RegExp` class and integrated with strings.
|
||||
|
||||
Please note that regular expressions vary between programming languages. In this tutorial we concentrate on JavaScript. Of course there's a lot in common, but they are a somewhat different in Perl, Ruby, PHP etc.
|
||||
|
||||
[cut]
|
||||
|
||||
## Regular expressions
|
||||
|
||||
A regular expression (also "regexp", or just "reg") consists of a *pattern* and optional *flags*.
|
||||
|
||||
There are two syntaxes to create a regular expression object.
|
||||
|
||||
The long syntax:
|
||||
|
||||
```js
|
||||
regexp = new RegExp("pattern", "flags");
|
||||
```
|
||||
|
||||
...And the short one, using slashes `"/"`:
|
||||
|
||||
```js
|
||||
regexp = /pattern/; // no flags
|
||||
regexp = /pattern/gmi; // with flags g,m and i (to be covered soon)
|
||||
```
|
||||
|
||||
Slashes `"/"` tell JavaScript that we are creating a regular expression. They play the same role as quotes for strings.
|
||||
|
||||
## Usage
|
||||
|
||||
To search inside a string, we can use method [search](mdn:js/String/search).
|
||||
|
||||
Here's an example:
|
||||
|
||||
```js run
|
||||
let str = "I love JavaScript!"; // will search here
|
||||
|
||||
let regexp = /love/;
|
||||
alert( str.search(regexp) ); // 2
|
||||
```
|
||||
|
||||
The `str.search` method looks for the pattern `pattern:/love/` and returns the position inside the string. As we might guess, `pattern:/love/` is the simplest possible pattern. What it does is a simple substring search.
|
||||
|
||||
The code above is the same as:
|
||||
|
||||
```js run
|
||||
let str = "I love JavaScript!"; // will search here
|
||||
|
||||
let substr = 'love';
|
||||
alert( str.search(substr) ); // 2
|
||||
```
|
||||
|
||||
So searching for `pattern:/love/` is the same as searching for `"love"`.
|
||||
|
||||
But that's only for now. Soon we'll create more complex regular expressions with much searching more power.
|
||||
|
||||
```smart header="Colors"
|
||||
From here on the color scheme is:
|
||||
|
||||
- regexp -- `pattern:red`
|
||||
- string (where we search) -- `subject:blue`
|
||||
- result -- `match:green`
|
||||
```
|
||||
|
||||
|
||||
````smart header="When to use `new RegExp`?"
|
||||
Normally we use the short syntax `/.../`. But it does not allow any variables insertions, so we must know the exact regexp at the time of writing the code.
|
||||
|
||||
From the other hand, `new RegExp` allows to construct a pattern dynamically from a string.
|
||||
|
||||
So we can figure out what we need to search and create `new RegExp` from it:
|
||||
|
||||
```js run
|
||||
let search = prompt("What you want to search?", "love");
|
||||
let regexp = new RegExp(search);
|
||||
|
||||
// find whatever the user wants
|
||||
alert( "I love JavaScript".search(regexp));
|
||||
```
|
||||
````
|
||||
|
||||
|
||||
## Flags
|
||||
|
||||
Regular expressions may have flags that affect the search.
|
||||
|
||||
There are only 5 of them in JavaScript:
|
||||
|
||||
`i`
|
||||
: With this flag the search is case-insensitive: no difference between `A` and `a` (see the example below).
|
||||
|
||||
`g`
|
||||
: With this flag the search looks for all matches, without it -- only the first one (we'll see uses in the next chapter).
|
||||
|
||||
`m`
|
||||
: Multiline mode (covered in the chapter <info:regexp-multiline>).
|
||||
|
||||
`u`
|
||||
: Enables full unicode support. The flag enables correct processing of surrogate pairs. More about that in the chapter <info:regexp-unicode>.
|
||||
|
||||
`y`
|
||||
: Sticky mode (covered in the [next chapter](info:regexp-methods#y-flag))
|
||||
|
||||
|
||||
## The "i" flag
|
||||
|
||||
The simplest flag is `i`.
|
||||
|
||||
An example with it:
|
||||
|
||||
```js run
|
||||
let str = "I love JavaScript!";
|
||||
|
||||
alert( str.search(/LOVE/) ); // -1 (not found)
|
||||
alert( str.search(/LOVE/i) ); // 2
|
||||
```
|
||||
|
||||
1. The first search returns `-1` (not found), because the search is case-sensitive by default.
|
||||
2. With the flag `pattern:/LOVE/i` the search found `match:love` at position 2.
|
||||
|
||||
So the `i` flag already makes regular expressions more powerful than a simple substring search. But there's so much more. We'll cover other flags and features in the next chapters.
|
||||
|
||||
|
||||
## Summary
|
||||
|
||||
- A regular expression consists of a pattern and optional flags: `g`, `i`, `m`, `u`, `y`.
|
||||
- Without flags and special symbols that we'll study later, the search by a regexp is the same as a substring search.
|
||||
- The method `str.search(regexp)` returns the index where the match is found or `-1` if there's no match.
|
409
5-regular-expressions/02-regexp-methods/article.md
Normal file
|
@ -0,0 +1,409 @@
|
|||
# Methods of RegExp and String
|
||||
|
||||
There are two sets of methods to deal with regular expressions.
|
||||
|
||||
1. First, regular expressions are objects of the built-in [RegExp](mdn:js/RegExp) class, it provides many methods.
|
||||
2. Besides that, there are methods in regular strings can work with regexps.
|
||||
|
||||
The structure is a bit messed up, so we'll first consider methods separately, and then -- practical recipes for common tasks.
|
||||
|
||||
[cut]
|
||||
|
||||
## str.search(reg)
|
||||
|
||||
We've seen this method already. It returns the position of the first match or `-1` if none found:
|
||||
|
||||
```js run
|
||||
let str = "A drop of ink may make a million think";
|
||||
|
||||
alert( str.search( *!*/a/i*/!* ) ); // 0 (the first position)
|
||||
```
|
||||
|
||||
**The important limitation: `search` always looks for the first match.**
|
||||
|
||||
We can't find next positions using `search`, there's just no syntax for that. But there are other mathods that can.
|
||||
|
||||
## str.match(reg), no "g" flag
|
||||
|
||||
The method `str.match` behavior varies depending on the `g` flag. First let's see the case without it.
|
||||
|
||||
Then `str.match(reg)` looks for the first match only.
|
||||
|
||||
The result is an array with that match and additional properties:
|
||||
|
||||
- `index` -- the position of the match inside the string,
|
||||
- `input` -- the subject string.
|
||||
|
||||
For instance:
|
||||
|
||||
```js run
|
||||
let str = "Fame is the thirst of youth";
|
||||
|
||||
let result = str.match( *!*/fame/i*/!* );
|
||||
|
||||
alert( result[0] ); // Fame (the match)
|
||||
alert( result.index ); // 0 (at the zero position)
|
||||
alert( result.input ); // "Fame is the thirst of youth" (the string)
|
||||
```
|
||||
|
||||
The array may have more than one element.
|
||||
|
||||
**If a part of the pattern is delimited by parentheses `(...)`, then it becomes a separate element of the array.**
|
||||
|
||||
For instance:
|
||||
|
||||
```js run
|
||||
lar str = "JavaScript is a programming language";
|
||||
|
||||
let result = str.match( *!*/JAVA(SCRIPT)/i*/!* );
|
||||
|
||||
alert( result[0] ); // JavaScript (the whole match)
|
||||
alert( result[1] ); // script (the part of the match that corresponds to the parentheses)
|
||||
alert( result.index ); // 0
|
||||
alert( result.input ); // JavaScript is a programming language
|
||||
```
|
||||
|
||||
Due to the `i` flag the search is case-insensitive, so it finds `match:JavaScript`. The part of the match that corresponds to `pattern:SCRIPT` becomes a separate array item.
|
||||
|
||||
We'll be back to parentheses later in the chapter <info:regexp-groups>. They are great for search-and-replace.
|
||||
|
||||
## str.match(reg) with "g" flag
|
||||
|
||||
When there's a `"g"` flag, then `str.match` returns an array of all matches. There are no additional properties in that array, and parentheses do not create any elements.
|
||||
|
||||
For instance:
|
||||
|
||||
```js run
|
||||
let str = "HO-Ho-ho!";
|
||||
|
||||
let result = str.match( *!*/ho/ig*/!* );
|
||||
|
||||
alert( result ); // HO, Ho, ho (all matches, case-insensitive)
|
||||
```
|
||||
|
||||
With parentheses nothing changes, here we go:
|
||||
|
||||
```js run
|
||||
let str = "HO-Ho-ho!";
|
||||
|
||||
let result = str.match( *!*/h(o)/ig*/!* );
|
||||
|
||||
alert( result ); // HO, Ho, ho
|
||||
```
|
||||
|
||||
So, with `g` flag the `result` is a simple array of matches. No additional properties.
|
||||
|
||||
If we want to get information about match positions and use parentheses then we should use [RegExp#exec](mdn:js/RegExp/exec) method that we'll cover below.
|
||||
|
||||
````warn header="If there are no matches, the call to `match` returns `null`"
|
||||
Please note, that's important. If there were no matches, the result is not an empty array, but `null`.
|
||||
|
||||
Keep that in mind to evade pitfalls like this:
|
||||
|
||||
```js run
|
||||
let str = "Hey-hey-hey!";
|
||||
|
||||
alert( str.match(/ho/gi).length ); // error! there's no length of null
|
||||
```
|
||||
````
|
||||
|
||||
## str.split(regexp|substr, limit)
|
||||
|
||||
Splits the string using the regexp (or a substring) as a delimiter.
|
||||
|
||||
We already used `split` with strings, like this:
|
||||
|
||||
```js run
|
||||
alert('12-34-56'.split('-')) // [12, 34, 56]
|
||||
```
|
||||
|
||||
But we can also pass a regular expression:
|
||||
|
||||
```js run
|
||||
alert('12-34-56'.split(/-/)) // [12, 34, 56]
|
||||
```
|
||||
|
||||
## str.replace(str|reg, str|func)
|
||||
|
||||
The swiss army knife for search and replace in strings.
|
||||
|
||||
The simplest use -- search and replace a substring, like this:
|
||||
|
||||
```js run
|
||||
// replace a dash by a colon
|
||||
alert('12-34-56'.replace("-", ":")) // 12:34-56
|
||||
```
|
||||
|
||||
When the first argument of `replace` is a string, it only looks for the first match.
|
||||
|
||||
To find all dashes, we need to use not the string `"-"`, but a regexp `pattern:/-/g`, with an obligatory `g` flag:
|
||||
|
||||
```js run
|
||||
// replace all dashes by a colon
|
||||
alert( '12-34-56'.replace( *!*/-/g*/!*, ":" ) ) // 12:34:56
|
||||
```
|
||||
|
||||
The second argument is a replacement string.
|
||||
|
||||
We can use special characters in it:
|
||||
|
||||
| Symbol | Inserts |
|
||||
|--------|--------|
|
||||
|`$$`|`"$"` |
|
||||
|`$&`|the whole match|
|
||||
|<code>$`</code>|a part of the string before the match|
|
||||
|`$'`|a part of the string after the match|
|
||||
|`$n`|if `n` is a 1-2 digit number, then it means the contents of n-th parentheses counting fro left to right|
|
||||
|
||||
For instance let's use `$&` to replace all entries of `"John"` by `"Mr.John"`:
|
||||
|
||||
```js run
|
||||
let str = "John Doe, John Smith and John Bull.";
|
||||
|
||||
// for each John - replace it with Mr. and then John
|
||||
alert(str.replace(/John/g, 'Mr.$&'));
|
||||
// "Mr.John Doe, Mr.John Smith and Mr.John Bull.";
|
||||
```
|
||||
|
||||
Parentheses are very often used together with `$1`, `$2`, like this:
|
||||
|
||||
```js run
|
||||
let str = "John Smith";
|
||||
|
||||
alert(str.replace(/(John) (Smith)/, '$2, $1')) // Smith, John
|
||||
```
|
||||
|
||||
**For situations that require "smart" replacements, the second argument can be a function.**
|
||||
|
||||
It will be called for each match, and its result will be inserted as a replacement.
|
||||
|
||||
For instance:
|
||||
|
||||
```js run
|
||||
let i = 0;
|
||||
|
||||
// replace each "ho" by the result of the function
|
||||
alert("HO-Ho-ho".replace(/ho/gi, function() {
|
||||
return ++i;
|
||||
})); // 1-2-3
|
||||
```
|
||||
|
||||
In the example above the function just returns the next number every time, but usually the result is based on the match.
|
||||
|
||||
The function is called with arguments `func(str, p1, p2, ..., pn, offset, s)`:
|
||||
|
||||
1. `str` -- the match,
|
||||
2. `p1, p2, ..., pn` -- contents of parentheses (if there are any),
|
||||
3. `offset` -- position of the match,
|
||||
4. `s` -- the source string.
|
||||
|
||||
If there are no parentheses in the regexp, then the function always has 3 arguments: `func(str, offset, s)`.
|
||||
|
||||
Let's use it to show full information about matches:
|
||||
|
||||
```js run
|
||||
// show and replace all matches
|
||||
function replacer(str, offset, s) {
|
||||
alert(`Found ${str} at position ${offset} in string ${s}`);
|
||||
return str.toLowerCase();
|
||||
}
|
||||
|
||||
let result = "HO-Ho-ho".replace(/ho/gi, replacer);
|
||||
alert( 'Result: ' + result ); // Result: ho-ho-ho
|
||||
|
||||
// shows each match:
|
||||
// Found HO at position 0 in string HO-Ho-ho
|
||||
// Found Ho at position 3 in string HO-Ho-ho
|
||||
// Found ho at position 6 in string HO-Ho-ho
|
||||
```
|
||||
|
||||
In the example below there are two parentheses, so `replacer` is called with 5 arguments: `str` is the full match, then parentheses, and then `offset` and `s`:
|
||||
|
||||
```js run
|
||||
function replacer(str, name, surname, offset, s) {
|
||||
// name is the first parentheses, surname is the second one
|
||||
return surname + ", " + name;
|
||||
}
|
||||
|
||||
let str = "John Smith";
|
||||
|
||||
alert(str.replace(/(John) (Smith)/, replacer)) // Smith, John
|
||||
```
|
||||
|
||||
Using a function gives us the ultimate replacement power, because it gets all the information about the match, has access to outer variables and can do everything.
|
||||
|
||||
## regexp.test(str)
|
||||
|
||||
Let's move on to the methods of `RegExp` class, that are callable on regexps themselves.
|
||||
|
||||
The `test` method looks for any match and returns `true/false` whether he found it.
|
||||
|
||||
So it's basically the same as `str.search(reg) != -1`, for instance:
|
||||
|
||||
```js run
|
||||
let str = "I love JavaScript";
|
||||
|
||||
// these two tests do the same
|
||||
alert( *!*/love/i*/!*.test(str) ); // true
|
||||
alert( str.search(*!*/love/i*/!*) != -1 ); // true
|
||||
```
|
||||
|
||||
An example with the negative answer:
|
||||
|
||||
```js run
|
||||
let str = "Bla-bla-bla";
|
||||
|
||||
alert( *!*/love/i*/!*.test(str) ); // false
|
||||
alert( str.search(*!*/love/i*/!*) != -1 ); // false
|
||||
```
|
||||
|
||||
## regexp.exec(str)
|
||||
|
||||
We've already seen these searching methods:
|
||||
|
||||
- `search` -- looks for the position of the match,
|
||||
- `match` -- if there's no `g` flag, returns the first match with parentheses,
|
||||
- `match` -- if there's a `g` flag -- returns all matches, without separating parentheses.
|
||||
|
||||
The `regexp.exec` method is a bit harder to use, but it allows to search all matches with parentheses and positions.
|
||||
|
||||
It behaves differently depending on whether the regexp has the `g` flag.
|
||||
|
||||
- If there's no `g`, then `regexp.exec(str)` returns the first match, exactly as `str.match(reg)`.
|
||||
- If there's `g`, then `regexp.exec(str)` returns the first match and *remembers* the position after it in `regexp.lastIndex` property. The next call starts to search from `regexp.lastIndex` and returns the next match. If there are no more matches then `regexp.exec` returns `null` and `regexp.lastIndex` is set to `0`.
|
||||
|
||||
As we can see, the method gives us nothing new if we use it without the `g` flag, because `str.match` does exactly the same.
|
||||
|
||||
But the `g` flag allows to get all matches with their positions and parentheses groups.
|
||||
|
||||
Here's the example how subsequent `regexp.exec` calls return matches one by one:
|
||||
|
||||
```js run
|
||||
let str = "A lot about JavaScript at https://javascript.info";
|
||||
|
||||
let regexp = /JAVA(SCRIPT)/ig;
|
||||
|
||||
*!*
|
||||
// Look for the first match
|
||||
*/!*
|
||||
let matchOne = regexp.exec(str);
|
||||
alert( matchOne[0] ); // JavaScript
|
||||
alert( matchOne[1] ); // script
|
||||
alert( matchOne.index ); // 12 (the position of the match)
|
||||
alert( matchOne.input ); // the same as str
|
||||
|
||||
alert( regexp.lastIndex ); // 22 (the position after the match)
|
||||
|
||||
*!*
|
||||
// Look for the second match
|
||||
*/!*
|
||||
let matchTwo = regexp.exec(str); // continue searching from regexp.lastIndex
|
||||
alert( matchTwo[0] ); // javascript
|
||||
alert( matchTwo[1] ); // script
|
||||
alert( matchTwo.index ); // 34 (the position of the match)
|
||||
alert( matchTwo.input ); // the same as str
|
||||
|
||||
alert( regexp.lastIndex ); // 44 (the position after the match)
|
||||
|
||||
*!*
|
||||
// Look for the third match
|
||||
*/!*
|
||||
let matchThree = regexp.exec(str); // continue searching from regexp.lastIndex
|
||||
alert( matchThree ); // null (no match)
|
||||
|
||||
alert( regexp.lastIndex ); // 0 (reset)
|
||||
```
|
||||
|
||||
As we can see, each `regexp.exec` call returns the match in a "full format": as an array with parentheses, `index` and `input` properties.
|
||||
|
||||
The main use case for `regexp.exec` is to find all matches in a loop:
|
||||
|
||||
```js run
|
||||
let str = 'A lot about JavaScript at https://javascript.info';
|
||||
|
||||
let regexp = /javascript/ig;
|
||||
|
||||
let result;
|
||||
|
||||
while (result = regexp.exec(str)) {
|
||||
alert( `Found ${result[0]} at ${result.index}` );
|
||||
}
|
||||
```
|
||||
|
||||
The loop continues until `regexp.exec` returns `null` that means "no more matches".
|
||||
|
||||
````smart header="Search from the given position"
|
||||
We can force `regexp.exec` to start searching from the given position by setting `lastIndex` manually:
|
||||
|
||||
```js run
|
||||
let str = 'A lot about JavaScript at https://javascript.info';
|
||||
|
||||
let regexp = /javascript/ig;
|
||||
regexp.lastIndex = 30;
|
||||
|
||||
alert( regexp.exec(str).index ); // 34, the search starts from the 30th position
|
||||
```
|
||||
````
|
||||
|
||||
## The "y" flag [#y-flag]
|
||||
|
||||
The `y` flag means that the search should find a match exactly at the position specified by the property `regexp.lastIndex` and only there.
|
||||
|
||||
In other words, normally the search is made in the whole string: `pattern:/javascript/` looks for "javascript" everywhere in the string.
|
||||
|
||||
But when a regexp has the `y` flag, then it only looks for the match at the position specified in `regexp.lastIndex` (`0` by default).
|
||||
|
||||
For instance:
|
||||
|
||||
```js run
|
||||
let str = "I love JavaScript!";
|
||||
|
||||
let reg = /javascript/iy;
|
||||
|
||||
alert( reg.lastIndex ); // 0 (default)
|
||||
alert( str.match(reg) ); // null, not found at position 0
|
||||
|
||||
reg.lastIndex = 7;
|
||||
alert( str.match(reg) ); // JavaScript (right, that word starts at position 7)
|
||||
|
||||
// for any other reg.lastIndex the result is null
|
||||
```
|
||||
|
||||
The regexp `pattern:/javascript/iy` can only be found if we set `reg.lastIndex=7`, because due to `y` flag the engine only tries to find it in the single place within a string -- from the `reg.lastIndex` position.
|
||||
|
||||
So, what's the point? Where do we apply that?
|
||||
|
||||
The reason is performance.
|
||||
|
||||
The `y` flag works great for parsers -- programs that need to "read" the text and build in-memory syntax structure or perform actions from it. For that we move along the text and apply regular expressions to see what we have next: a string? A number? Something else?
|
||||
|
||||
The `y` flag allows to apply a regular expression (or many of them one-by-one) exactly at the given position and when we understand what's there, we can move on -- step by step examining the text.
|
||||
|
||||
Without the flag the regexp engine always searches till the end of the text, that takes time, especially if the text is large. So our parser would be very slow. The `y` flag is exactly the right thing here.
|
||||
|
||||
## Summary, recipes
|
||||
|
||||
Methods become much easier to understand if we separate them by their use in real-life tasks.
|
||||
|
||||
To search for the first match only:
|
||||
: - Find the position of the first match -- `str.search(reg)`.
|
||||
- Find the full match -- `str.match(reg)`.
|
||||
- Check if there's a match -- `regexp.test(str)`.
|
||||
- Find the match from the given position -- `regexp.exec(str)`, set `regexp.lastIndex` to position.
|
||||
|
||||
To search for all matches:
|
||||
: - An array of matches -- `str.match(reg)`, the regexp with `g` flag.
|
||||
- Get all matches with full information about each one -- `regexp.exec(str)` with `g` flag in the loop.
|
||||
|
||||
To search and replace:
|
||||
: - Replace with another string or a function result -- `str.replace(reg, str|func)`
|
||||
|
||||
To split the string:
|
||||
: - `str.split(str|reg)`
|
||||
|
||||
We also covered two flags:
|
||||
|
||||
- The `g` flag to find all matches (global search),
|
||||
- The `y` flag to search at exactly the given position inside the text.
|
||||
|
||||
Now we know the methods and can use regular expressions. But we need to learn their syntax, so let's move on.
|
|
@ -0,0 +1,6 @@
|
|||
|
||||
The answer: `pattern:\b\d\d:\d\d\b`.
|
||||
|
||||
```js run
|
||||
alert( "Breakfast at 09:00 in the room 123:456.".match( /\b\d\d:\d\d\b/ ) ); // 09:00
|
||||
```
|
|
@ -0,0 +1,8 @@
|
|||
# Find the time
|
||||
|
||||
The time has a format: `hours:minutes`. Both hours and minutes has two digits, like `09:00`.
|
||||
|
||||
Make a regexp to find time in the string: `subject:Breakfast at 09:00 in the room 123:456.`
|
||||
|
||||
P.S. In this task there's no need to check time correctness yet, so `25:99` can also be a valid result.
|
||||
P.P.S. The regexp shouldn't match `123:456`.
|
228
5-regular-expressions/03-regexp-character-classes/article.md
Normal file
|
@ -0,0 +1,228 @@
|
|||
# Character classes
|
||||
|
||||
Consider a practical task -- we have a phone number `"+7(903)-123-45-67"`, and we need to find all digits in that string. Other characters do not interest us.
|
||||
|
||||
A character class is a special notation that matches any symbol from the set.
|
||||
|
||||
[cut]
|
||||
|
||||
For instance, there's a "digit" class. It's written as `\d`. We put it in the pattern, and during the search any digit matches it.
|
||||
|
||||
For instance, the regexp `pattern:/\d/` looks for a single digit:
|
||||
|
||||
```js run
|
||||
let str = "+7(903)-123-45-67";
|
||||
|
||||
let reg = /\d/;
|
||||
|
||||
alert( str.match(reg) ); // 7
|
||||
```
|
||||
|
||||
The regexp is not global in the example above, so it only looks for the first match.
|
||||
|
||||
Let's add the `g` flag to look for all digits:
|
||||
|
||||
```js run
|
||||
let str = "+7(903)-123-45-67";
|
||||
|
||||
let reg = /\d/g;
|
||||
|
||||
alert( str.match(reg) ); // array of matches: 7,9,0,3,1,2,3,4,5,6,7
|
||||
```
|
||||
|
||||
## Most used classes: \d \s \w
|
||||
|
||||
That was a character class for digits. There are other character classes as well.
|
||||
|
||||
Most used are:
|
||||
|
||||
`\d` ("d" is from "digit")
|
||||
: A digit: a character from `0` to `9`.
|
||||
|
||||
`\s` ("s" is from "space")
|
||||
: A space symbol: that includes spaces, tabs, newlines.
|
||||
|
||||
`\w` ("w" is from "word")
|
||||
: A "wordly" character: either a letter of English alphabet or a digit or an underscore. Non-english letters (like cyricllic or hindi) do not belong to `\w`.
|
||||
|
||||
For instance, `pattern:\d\s\w` means a digit followed by a space character followed by a wordly character, like `"1 Z"`.
|
||||
|
||||
A regexp may contain both regular symbols and character classes.
|
||||
|
||||
For instance, `pattern:CSS\d` matches a string `match:CSS` with a digit after it:
|
||||
|
||||
```js run
|
||||
let str = "CSS4 is cool";
|
||||
let reg = /CSS\d/
|
||||
|
||||
alert( str.match(reg) ); // CSS4
|
||||
```
|
||||
|
||||
Also we can use many character classes:
|
||||
|
||||
```js run
|
||||
alert( "I love HTML5!".match(/\s\w\w\w\w\d/) ); // 'HTML5'
|
||||
```
|
||||
|
||||
The match (each character class corresponds to one result character):
|
||||
|
||||

|
||||
|
||||
## Word boundary: \b
|
||||
|
||||
The word boundary `pattern:\b` -- is a special character class.
|
||||
|
||||
It does not denote a character, but rather a boundary between characters.
|
||||
|
||||
For instance, `pattern:\bJava\b` matches `match:Java` in the string `subject:Hello, Java!`, but not in the script `subject:Hello, JavaScript!`.
|
||||
|
||||
```js run
|
||||
alert( "Hello, Java!".match(/\bJava\b/) ); // Java
|
||||
alert( "Hello, JavaScript!".match(/\bJava\b/) ); // null
|
||||
```
|
||||
|
||||
The boundary has "zero width" in a sense that usually a character class means a character in the result (like a wordly or a digit), but not in this case.
|
||||
|
||||
The boundary is a test.
|
||||
|
||||
When regular expression engine is doing the search, it's moving along the string in an attempt to find the match. At each string position it tries to find the pattern.
|
||||
|
||||
When the pattern contains `pattern:\b`, it tests that the position in string fits one of the conditions:
|
||||
|
||||
- String start, and the first string character is `\w`.
|
||||
- String end, and the last string character is `\w`.
|
||||
- Inside the string: from one side is `\w`, from the other side -- not `\w`.
|
||||
|
||||
For instance, in the string `subject:Hello, Java!` the following positions match `\b`:
|
||||
|
||||

|
||||
|
||||
So it matches `pattern:\bHello\b` and `pattern:\bJava\b`, but not `pattern:\bHell\b` (because there's no word boundary after `l`) and not `Java!\b` (because the exclamation sign is not a wordly character, so there's no word boundary after it).
|
||||
|
||||
```js run
|
||||
alert( "Hello, Java!".match(/\bHello\b/) ); // Hello
|
||||
alert( "Hello, Java!".match(/\Java\b/) ); // Java
|
||||
alert( "Hello, Java!".match(/\Hell\b/) ); // null
|
||||
alert( "Hello, Java!".match(/\Java!\b/) ); // null
|
||||
```
|
||||
|
||||
Once again let's note that `pattern:\b` makes the searching engine to test for the boundary, so that `pattern:Java\b` finds `match:Java` only when followed by a word boundary, but it does not add a letter to the result.
|
||||
|
||||
Usually we use `\b` to find standalone English words. So that if we want `"Java"` language then `pattern:\bJava\b` finds exactly a standalone word and ignores it when it's a part of `"JavaScript"`.
|
||||
|
||||
Another example: a regexp `pattern:\b\d\d\b` looks for standalone two-digit numbers. In other words, it requires that before and after `pattern:\d\d` must be a symbol different from `\w` (or beginning/end of the string).
|
||||
|
||||
```js run
|
||||
alert( "1 23 456 78".match(/\b\d\d\b/g) ); // 23,78
|
||||
```
|
||||
|
||||
```warn header="Word boundary doesn't work for non-English alphabets"
|
||||
The word boundary check `\b` tests for a boundary between `\w` and something else. But `\w` means an English letter (or a digit or an underscore), so the test won't work for other characters (like cyrillic or hieroglyphs).
|
||||
```
|
||||
|
||||
|
||||
## Reverse classes
|
||||
|
||||
For every character class there exists a "reverse class", denoted with the same letter, but uppercased.
|
||||
|
||||
The "reverse" means that it matches all other characters, for instance:
|
||||
|
||||
`\D`
|
||||
: Non-digit: any character except `\d`, for instance a letter.
|
||||
|
||||
`\S`
|
||||
: Non-space: any character except `\s`, for instance a letter.
|
||||
|
||||
`\W`
|
||||
: Non-wordly character: anything but `\w`.
|
||||
|
||||
`\B`
|
||||
: Non-boundary: a test reverse to `\b`.
|
||||
|
||||
In the beginning of the chapter we saw how to get all digits from the phone `subject:+7(903)-123-45-67`. Let's get a "pure" phone number from the string:
|
||||
|
||||
```js run
|
||||
let str = "+7(903)-123-45-67";
|
||||
|
||||
alert( str.match(/\d/g).join('') ); // 79031234567
|
||||
```
|
||||
|
||||
An alternative way would be to find non-digits and remove them from the string:
|
||||
|
||||
|
||||
```js run
|
||||
let str = "+7(903)-123-45-67";
|
||||
|
||||
alert( str.replace(/\D/g, "") ); // 79031234567
|
||||
```
|
||||
|
||||
## Spaces are regular characters
|
||||
|
||||
Please note that regular expressions may include spaces. They are treated like regular characters.
|
||||
|
||||
Usually we pay little attention to spaces. For us strings `subject:1-5` and `subject:1 - 5` are nearly identical.
|
||||
|
||||
But if a regexp does not take spaces into account, it won' work.
|
||||
|
||||
Let's try to find digits separated by a dash:
|
||||
|
||||
```js run
|
||||
alert( "1 - 5".match(/\d-\d/) ); // null, no match!
|
||||
```
|
||||
|
||||
Here we fix it by adding spaces into the regexp:
|
||||
|
||||
```js run
|
||||
alert( "1 - 5".match(/\d - \d/) ); // 1 - 5, now it works
|
||||
```
|
||||
|
||||
Of course, spaces are needed only if we look for them. Extra spaces (just like any other extra characters) may prevent a match:
|
||||
|
||||
```js run
|
||||
alert( "1-5".match(/\d - \d/) ); // null, because the string 1-5 has no spaces
|
||||
```
|
||||
|
||||
In other words, in a regular expression all characters matter. Spaces too.
|
||||
|
||||
## A dot is any character
|
||||
|
||||
The dot `"."` is a special character class that matches *any character except a newline*.
|
||||
|
||||
For instance:
|
||||
|
||||
```js run
|
||||
alert( "Z".match(/./) ); // Z
|
||||
```
|
||||
|
||||
Or in the middle of a regexp:
|
||||
|
||||
```js run
|
||||
let reg = /CS.4/;
|
||||
|
||||
alert( "CSS4".match(reg) ); // CSS4
|
||||
alert( "CS-4".match(reg) ); // CS-4
|
||||
alert( "CS 4".match(reg) ); // CS 4 (space is also a character)
|
||||
```
|
||||
|
||||
Please note that the dot means "any character", but not the "absense of a character". There must be a character to match it:
|
||||
|
||||
```js run
|
||||
alert( "CS4".match(/CS.4/) ); // null, no match because there's no character for the dot
|
||||
```
|
||||
|
||||
|
||||
## Summary
|
||||
|
||||
We covered character classes:
|
||||
|
||||
- `\d` -- digits.
|
||||
- `\D` -- non-digits.
|
||||
- `\s` -- space symbols, tabs, newlines.
|
||||
- `\S` -- all but `\s`.
|
||||
- `\w` -- English letters, digits, underscore `'_'`.
|
||||
- `\W` -- all but `\w`.
|
||||
- `'.'` -- any character except a newline.
|
||||
|
||||
If we want to search for a character that has a special meaning like a backslash or a dot, then we should escape it with a backslash: `pattern:\.`
|
||||
|
||||
Please note that a regexp may also contain string special characters such as a newline `\n`. There's no conflict with character classes, because other letters are used for them.
|
After Width: | Height: | Size: 3.6 KiB |
After Width: | Height: | Size: 7.5 KiB |
After Width: | Height: | Size: 4 KiB |
After Width: | Height: | Size: 8.7 KiB |
91
5-regular-expressions/04-regexp-escaping/article.md
Normal file
|
@ -0,0 +1,91 @@
|
|||
|
||||
# Escaping, special characters
|
||||
|
||||
As we've seen, a backslash `"\"` is used to denote character classes. So it's a special character.
|
||||
|
||||
There are other special characters as well, that have special meaning in a regexp. They are used to do more powerful searches.
|
||||
|
||||
Here's a full list of them: `pattern:[ \ ^ $ . | ? * + ( )`.
|
||||
|
||||
Don't try to remember it -- when we deal with each of them separately, you'll know it by heart automatically.
|
||||
|
||||
## Escaping
|
||||
|
||||
To use a special character as a regular one, prepend it with a backslash.
|
||||
|
||||
That's also called "escaping a character".
|
||||
|
||||
For instance, we need to find a dot `pattern:'.'`. In a regular expression a dot means "any character except a newline", so if we really mean "a dot", let's put a backslash before it: `pattern:\.`.
|
||||
|
||||
```js run
|
||||
alert( "Chapter 5.1".match(/\d\.\d/) ); // 5.1
|
||||
```
|
||||
|
||||
Parentheses are also special characters, so if we want them, we should use `pattern:\(`. The example below looks for a string `"g()"`:
|
||||
|
||||
```js run
|
||||
alert( "function g()".match(/g\(\)/) ); // "g()"
|
||||
```
|
||||
|
||||
If we're looking for a backslash `\`, then we should double it:
|
||||
|
||||
```js run
|
||||
alert( "1\2".match(/\\/) ); // '\'
|
||||
```
|
||||
|
||||
## A slash
|
||||
|
||||
The slash symbol `'/'` is not a special character, but in JavaScript it is used to open and close the regexp: `pattern:/...pattern.../`, so we should escape it too.
|
||||
|
||||
Here's what a search for a slash `'/'` looks like:
|
||||
|
||||
```js run
|
||||
alert( "/".match(/\//) ); // '/'
|
||||
```
|
||||
|
||||
From the other hand, the alternative `new RegExp` syntaxes does not require escaping it:
|
||||
|
||||
```js run
|
||||
alert( "/".match(new RegExp("/")) ); // '/'
|
||||
```
|
||||
|
||||
## new RegExp
|
||||
|
||||
If we are creating a regular expression with `new RegExp`, then we need to do some more escaping.
|
||||
|
||||
For instance, consider this:
|
||||
|
||||
```js run
|
||||
let reg = new RegExp("\d\.\d");
|
||||
|
||||
alert( "Chapter 5.1".match(reg) ); // null
|
||||
```
|
||||
|
||||
It doesn't work, but why?
|
||||
|
||||
The reason is string escaping rules. Look here:
|
||||
|
||||
```js run
|
||||
alert("\d\.\d"); // d.d
|
||||
```
|
||||
|
||||
Backslashes are used for escaping inside a string and string-specific special characters like `\n`. The quotes "consume" and interpret them, for instance:
|
||||
|
||||
- `\n` -- becomes a newline character,
|
||||
- `\u1234` -- becomes the Unicode character with such code,
|
||||
- ...And when there's no special meaning: like `\d` or `\z`, then the backslash is simply removed.
|
||||
|
||||
So the call to `new RegExp` gets a string without backslashes.
|
||||
|
||||
To fix it, we need to double backslashes, because quotes turn `\\` into `\`:
|
||||
|
||||
```js run
|
||||
*!*
|
||||
let regStr = "\\d\\.\\d";
|
||||
*/!*
|
||||
alert(regStr); // \d\.\d (correct now)
|
||||
|
||||
let reg = new RegExp(regStr);
|
||||
|
||||
alert( "Chapter 5.1".match(reg) ); // 5.1
|
||||
```
|
|
@ -0,0 +1,12 @@
|
|||
Answers: **no, yes**.
|
||||
|
||||
- In the script `subject:Java` it doesn't match anything, because `pattern:[^script]` means "any character except given ones". So the regexp looks for `"Java"` followed by one such symbol, but there's a string end, no symbols after it.
|
||||
|
||||
```js run
|
||||
alert( "Java".match(/Java[^script]/) ); // null
|
||||
```
|
||||
- Yes, because the regexp is case-insensitive, the `pattern:[^script]` part matches the character `"S"`.
|
||||
|
||||
```js run
|
||||
alert( "JavaScript".match(/Java[^script]/) ); // "JavaS"
|
||||
```
|
|
@ -0,0 +1,5 @@
|
|||
# Java[^script]
|
||||
|
||||
We have a regexp `pattern:/Java[^script]/`.
|
||||
|
||||
Does it match anything in the string `subject:Java`? In the string `subject:JavaScript`?
|
|
@ -0,0 +1,8 @@
|
|||
Answer: `pattern:\d\d[-:]\d\d`.
|
||||
|
||||
```js run
|
||||
let reg = /\d\d[-:]\d\d/g;
|
||||
alert( "Breakfast at 09:00. Dinner at 21-30".match(reg) ); // 09:00, 21-30
|
||||
```
|
||||
|
||||
Please note that the dash `pattern:'-'` has a special meaning in square brackets, but only between other characters, not when it's in the beginning or at the end, so we don't need to escape it.
|
|
@ -0,0 +1,12 @@
|
|||
# Find the time as hh:mm or hh-mm
|
||||
|
||||
The time can be in the format `hours:minutes` or `hours-minutes`. Both hours and minutes have 2 digits: `09:00` or `21-30`.
|
||||
|
||||
Write a regexp to find time:
|
||||
|
||||
```js
|
||||
let reg = /your regexp/g;
|
||||
alert( "Breakfast at 09:00. Dinner at 21-30".match(reg) ); // 09:00, 21-30
|
||||
```
|
||||
|
||||
P.S. In this task we assume that the time is always correct, there's no need to filter out bad strings like "45:67". Later we'll deal with that too.
|
|
@ -0,0 +1,116 @@
|
|||
# Sets and ranges [...]
|
||||
|
||||
Several characters or character classes inside square brackets `[…]` mean to "search for any character among given".
|
||||
|
||||
[cut]
|
||||
|
||||
## Sets
|
||||
|
||||
For instance, `pattern:[eao]` means any of the 3 characters: `'a'`, `'e'`, or `'o'`.
|
||||
|
||||
That's calles a *set*. Sets can be used in a regexp along with regular characters:
|
||||
|
||||
```js run
|
||||
// find [t or m], and then "op"
|
||||
alert( "Mop top".match(/[tm]op/gi) ); // "Mop", "top"
|
||||
```
|
||||
|
||||
Please note that although there are multiple characters in the set, they correspond to exactly one character in the match.
|
||||
|
||||
So the example above gives no matches:
|
||||
|
||||
```js run
|
||||
// find "V", then [o or i], then "la"
|
||||
alert( "Voila".match(/V[oi]la/) ); // null, no matches
|
||||
```
|
||||
|
||||
The pattern assumes:
|
||||
|
||||
- `pattern:V`,
|
||||
- then *one* of the letters `pattern:[oi]`,
|
||||
- then `pattern:la`.
|
||||
|
||||
So there would be a match for `match:Vola` or `match:Vila`.
|
||||
|
||||
## Ranges
|
||||
|
||||
Square brackets may also contain *character ranges*.
|
||||
|
||||
For instance, `pattern:[a-z]` is a character in range from `a` to `z`, and `pattern:[0-5]` is a digit from `0` to `5`.
|
||||
|
||||
In the example below we're searching for `"x"` followed by two digits or letters from `A` to `F`:
|
||||
|
||||
```js run
|
||||
alert( "Exception 0xAF".match(/x[0-9A-F][0-9A-F]/g) ); // xAF
|
||||
```
|
||||
|
||||
Please note that in the word `subject:Exception` there's a substring `subject:xce`. It didn't match the pattern, because the letters are lowercase, while in the set `pattern:[0-9A-F]` they are uppercase.
|
||||
|
||||
If we want to find it too, then we can add a range `a-f`: `pattern:[0-9A-Fa-f]`. The `i` flag would allow lowercase too.
|
||||
|
||||
**Character classes are shorthands for certain character sets.**
|
||||
|
||||
For instance:
|
||||
|
||||
- **\d** -- is the same as `pattern:[0-9]`,
|
||||
- **\w** -- is the same as `pattern:[a-zA-Z0-9_]`,
|
||||
- **\s** -- is the same as `pattern:[\t\n\v\f\r ]` plus few other unicode space characters.
|
||||
|
||||
We can use character classes inside `[…]` as well.
|
||||
|
||||
For instance, we want to match all wordly characters or a dash, for words like "twenty-third". We can't do it with `pattern:\w+`, because `pattern:\w` class does not include a dash. But we can use `pattern:[\w-]`.
|
||||
|
||||
We also can use a combination of classes to cover every possible character, like `pattern:[\s\S]`. That matches spaces or non-spaces -- any character. That's wider than a dot `"."`, because the dot matches any character except a newline.
|
||||
|
||||
## Excluding ranges
|
||||
|
||||
Besides normal ranges, there are "excluding" ranges that look like `pattern:[^…]`.
|
||||
|
||||
They are denoted by a caret character `^` at the start and match any character *except the given ones*.
|
||||
|
||||
For instance:
|
||||
|
||||
- `pattern:[^aeyo]` -- any character except `'a'`, `'e'`, `'y'` or `'o'`.
|
||||
- `pattern:[^0-9]` -- any character except a digit, the same as `\D`.
|
||||
- `pattern:[^\s]` -- any non-space character, same as `\S`.
|
||||
|
||||
The example below looks for any characters except letters, digits and spaces:
|
||||
|
||||
```js run
|
||||
alert( "alice15@gmail.com".match(/[^\d\sA-Z]/gi) ); // @ and .
|
||||
```
|
||||
|
||||
## No escaping in […]
|
||||
|
||||
Usually when we want to find exactly the dot character, we need to escape it like `pattern:\.`. And if we need a backslash, then we use `pattern:\\`.
|
||||
|
||||
In square brackets the vast majority of special characters can be used without escaping:
|
||||
|
||||
- A dot `pattern:'.'`.
|
||||
- A plus `pattern:'+'`.
|
||||
- Parentheses `pattern:'( )'`.
|
||||
- Dash `pattern:'-'` in the beginning or the end (where it does not define a range).
|
||||
- A caret `pattern:'^'` if not in the beginning (where it means exclusion).
|
||||
- And the opening square bracket `pattern:'['`.
|
||||
|
||||
In other words, all special charactere are allowed except where they mean something for square brackets.
|
||||
|
||||
A dot `"."` inside square brackets means just a dot. The pattern `pattern:[.,]` would look for one of characters: either a dot or a comma.
|
||||
|
||||
In the example below the regexp `pattern:[-().^+]` looks for one of the characters `-().^`:
|
||||
|
||||
```js run
|
||||
// No need to escape
|
||||
let reg = /[-().^+]/g;
|
||||
|
||||
alert( "1 + 2 - 3".match(reg) ); // Matches +, -
|
||||
```
|
||||
|
||||
...But if you decide to escape them "just in case", then there would be no harm:
|
||||
|
||||
```js run
|
||||
// Escaped everything
|
||||
let reg = /[\-\(\)\.\^\+]/g;
|
||||
|
||||
alert( "1 + 2 - 3".match(reg) ); // also works: +, -
|
||||
```
|
69
5-regular-expressions/06-regexp-unicode/article.md
Normal file
|
@ -0,0 +1,69 @@
|
|||
|
||||
# The unicode flag
|
||||
|
||||
The unicode flag `/.../u` enables the correct support of surrogate pairs.
|
||||
|
||||
Surrogate pairs are explained in the chapter <info:string>.
|
||||
|
||||
Let's briefly remind them here. In short, normally characters are encoded with 2 bytes. That gives us 65536 characters maximum. But there are more characters in the world.
|
||||
|
||||
So certain rare characters are encoded with 4 bytes, like `𝒳` (mathematical X) or `😄` (a smile).
|
||||
|
||||
Here are the unicode values to compare:
|
||||
|
||||
| Character | Unicode | Bytes |
|
||||
|------------|---------|--------|
|
||||
| `a` | 0x0061 | 2 |
|
||||
| `≈` | 0x2248 | 2 |
|
||||
|`𝒳`| 0x1d4b3 | 4 |
|
||||
|`𝒴`| 0x1d4b4 | 4 |
|
||||
|`😄`| 0x1f604 | 4 |
|
||||
|
||||
So characters like `a` and `≈` occupy 2 bytes, and those rare ones take 4.
|
||||
|
||||
The unicode is made in such a way that the 4-byte characters only have a meaning as a whole.
|
||||
|
||||
In the past JavaScript did not know about that, and many string methods still have problems. For instance, `length` thinks that here are two characters:
|
||||
|
||||
```js run
|
||||
alert('😄'.length); // 2
|
||||
alert('𝒳'.length); // 2
|
||||
```
|
||||
|
||||
...But we can see that there's only one, right? The point is that `length` treats 4 bytes as two 2-byte characters. That's incorrect, because they must be considered only together (so-called "surrogate pair").
|
||||
|
||||
Normally, regular expressions also treat "long characters" as two 2-byte ones.
|
||||
|
||||
That leads to odd results, for instance let's try to find `pattern:[𝒳𝒴]` in the string `subject:𝒳`:
|
||||
|
||||
```js run
|
||||
alert( '𝒳'.match(/[𝒳𝒴]/) ); // odd result
|
||||
```
|
||||
|
||||
The result would be wrong, because by default the regexp engine does not understand surrogate pairs. It thinks that `[𝒳𝒴]` are not two, but four characters: the left half of `𝒳` `(1)`, the right half of `𝒳` `(2)`, the left half of `𝒴` `(3)`, the right half of `𝒴` `(4)`.
|
||||
|
||||
So it finds the left half of `𝒳` in the string `𝒳`, not the whole symbol.
|
||||
|
||||
In other words, the search works like `'12'.match(/[1234]/)` -- the `1` is returned (left half of `𝒳`).
|
||||
|
||||
The `/.../u` flag fixes that. It enables surrogate pairs in the regexp engine, so the result is correct:
|
||||
|
||||
```js run
|
||||
alert( '𝒳'.match(/[𝒳𝒴]/u) ); // 𝒳
|
||||
```
|
||||
|
||||
There's an error that may happen if we forget the flag:
|
||||
|
||||
```js run
|
||||
'𝒳'.match(/[𝒳-𝒴]/); // SyntaxError: invalid range in character class
|
||||
```
|
||||
|
||||
Here the regexp `[𝒳-𝒴]` is treated as `[12-34]` (where `2` is the right part of `𝒳` and `3` is the left part of `𝒴`), and the range between two halves `2` and `3` is unacceptable.
|
||||
|
||||
Using the flag would make it work right:
|
||||
|
||||
```js run
|
||||
alert( '𝒴'.match(/[𝒳-𝒵]/u) ); // 𝒴
|
||||
```
|
||||
|
||||
To finalize, let's note that if we do not deal with surrogate pairs, then the flag does nothing for us. But in the modern world we often meet them.
|
|
@ -0,0 +1,9 @@
|
|||
|
||||
Solution:
|
||||
|
||||
```js run
|
||||
let reg = /\.{3,}/g;
|
||||
alert( "Hello!... How goes?.....".match(reg) ); // ..., .....
|
||||
```
|
||||
|
||||
Please note that the dot is a special character, so we have to escape it and insert as `\.`.
|
|
@ -0,0 +1,14 @@
|
|||
importance: 5
|
||||
|
||||
---
|
||||
|
||||
# How to find an ellipsis "..." ?
|
||||
|
||||
Create a regexp to find ellipsis: 3 (or more?) dots in a row.
|
||||
|
||||
Check it:
|
||||
|
||||
```js
|
||||
let reg = /your regexp/g;
|
||||
alert( "Hello!... How goes?.....".match(reg) ); // ..., .....
|
||||
```
|
|
@ -0,0 +1,31 @@
|
|||
We need to look for `#` followed by 6 hexadimal characters.
|
||||
|
||||
A hexadimal character can be described as `pattern:[0-9a-fA-F]`. Or if we use the `i` flag, then just `pattern:[0-9a-f]`.
|
||||
|
||||
Then we can look for 6 of them using the quantifier `pattern:{6}`.
|
||||
|
||||
As a result, we have the regexp: `pattern:/#[a-f0-9]{6}/gi`.
|
||||
|
||||
```js run
|
||||
let reg = /#[a-f0-9]{6}/gi;
|
||||
|
||||
let str = "color:#121212; background-color:#AA00ef bad-colors:f#fddee #fd2"
|
||||
|
||||
alert( str.match(reg) ); // #121212,#AA00ef
|
||||
```
|
||||
|
||||
The problem is that it finds the color in longer sequences:
|
||||
|
||||
```js run
|
||||
alert( "#12345678".match( /#[a-f0-9]{6}/gi ) ) // #12345678
|
||||
```
|
||||
|
||||
To fix that, we can add `pattern:\b` to the end:
|
||||
|
||||
```js run
|
||||
// color
|
||||
alert( "#123456".match( /#[a-f0-9]{6}\b/gi ) ); // #123456
|
||||
|
||||
// not a color
|
||||
alert( "#12345678".match( /#[a-f0-9]{6}\b/gi ) ); // null
|
||||
```
|
|
@ -0,0 +1,15 @@
|
|||
# Regexp for HTML colors
|
||||
|
||||
Create a regexp to search HTML-colors written as `#ABCDEF`: first `#` and then 6 hexadimal characters.
|
||||
|
||||
An example of use:
|
||||
|
||||
```js
|
||||
let reg = /...your regexp.../
|
||||
|
||||
let str = "color:#121212; background-color:#AA00ef bad-colors:f#fddee #fd2 #12345678";
|
||||
|
||||
alert( str.match(reg) ) // #121212,#AA00ef
|
||||
```
|
||||
|
||||
P.S. In this task we do not need other color formats like `#123` or `rgb(1,2,3)` etc.
|
133
5-regular-expressions/07-regexp-quantifiers/article.md
Normal file
|
@ -0,0 +1,133 @@
|
|||
# Quantifiers +, *, ? and {n}
|
||||
|
||||
Let's say we have a string like `+7(903)-123-45-67` and want to find all numbers in it. But unlike before, we are interested in not digits, but full numbers: `7, 903, 123, 45, 67`.
|
||||
|
||||
A number is a sequence of 1 or more digits `\d`. The instrument to say how many we need is called *quantifiers*.
|
||||
|
||||
## Quantity {n}
|
||||
|
||||
The most obvious quantifier is a number in figure quotes: `pattern:{n}`. A quantifier is put after a character (or a character class and so on) and specifies exactly how many we need.
|
||||
|
||||
It also has advanced forms, here we go with examples:
|
||||
|
||||
Exact count: `{5}`
|
||||
: `pattern:\d{5}` denotes exactly 5 digits, the same as `pattern:\d\d\d\d\d`.
|
||||
|
||||
The example below looks for a 5-digit number:
|
||||
|
||||
```js run
|
||||
alert( "I'm 12345 years old".match(/\d{5}/) ); // "12345"
|
||||
```
|
||||
|
||||
We can add `\b` to exclude longer numbers: `pattern:\b\d{5}\b`.
|
||||
|
||||
The count from-to: `{3,5}`
|
||||
: To find numbers from 3 to 5 digits we can put the limits into figure brackets: `pattern:\d{3,5}`
|
||||
|
||||
```js run
|
||||
alert( "I'm not 12, but 1234 years old".match(/\d{3,5}/) ); // "1234"
|
||||
```
|
||||
|
||||
We can omit the upper limit. Then a regexp `pattern:\d{3,}` looks for numbers of `3` and more digits:
|
||||
|
||||
```js run
|
||||
alert( "I'm not 12, but 345678 years old".match(/\d{3,}/) ); // "345678"
|
||||
```
|
||||
|
||||
In case with the string `+7(903)-123-45-67` we need numbers: one or more digits in a row. That is `pattern:\d{1,}`:
|
||||
|
||||
```js run
|
||||
let str = "+7(903)-123-45-67";
|
||||
|
||||
let numbers = str.match(/\d{1,}/g);
|
||||
|
||||
alert(numbers); // 7,903,123,45,67
|
||||
```
|
||||
|
||||
## Shorthands
|
||||
|
||||
Most often needed quantifiers have shorthands:
|
||||
|
||||
`+`
|
||||
: Means "one or more", the same as `{1,}`.
|
||||
|
||||
For instance, `pattern:\d+` looks for numbers:
|
||||
|
||||
```js run
|
||||
let str = "+7(903)-123-45-67";
|
||||
|
||||
alert( str.match(/\d+/g) ); // 7,903,123,45,67
|
||||
```
|
||||
|
||||
`?`
|
||||
: Means "zero or one", the same as `{0,1}`. In other words, it makes the symbol optional.
|
||||
|
||||
For instance, the pattern `pattern:ou?r` looks for `match:o` followed by zero or one `match:u`, and then `match:r`.
|
||||
|
||||
So it can find `match:or` in the word `subject:color` and `match:our` in `subject:colour`:
|
||||
|
||||
```js run
|
||||
let str = "Should I write color or colour?";
|
||||
|
||||
alert( str.match(/colou?r/g) ); // color, colour
|
||||
```
|
||||
|
||||
`*`
|
||||
: Means "zero or more", the same as `{0,}`. That is, the character may repeat any times or be absent.
|
||||
|
||||
The example below looks for a digit followed by any number of zeroes:
|
||||
|
||||
```js run
|
||||
alert( "100 10 1".match(/\d0*/g) ); // 100, 10, 1
|
||||
```
|
||||
|
||||
Compare it with `'+'` (one or more):
|
||||
|
||||
```js run
|
||||
alert( "100 10 1".match(/\d0+/g) ); // 100, 10
|
||||
```
|
||||
|
||||
## More examples
|
||||
|
||||
Quantifiers are used very often. They are one of the main "building blocks" for complex regular expressions, so let's see more examples.
|
||||
|
||||
Regexp "decimal fraction" (a number with a floating point): `pattern:\d+\.\d+`
|
||||
: In action:
|
||||
```js run
|
||||
alert( "0 1 12.345 7890".match(/\d+\.\d+/g) ); // 12.345
|
||||
```
|
||||
|
||||
Regexp "open HTML-tag without attributes", like `<span>` or `<p>`: `pattern:/<[a-z]+>/i`
|
||||
: In action:
|
||||
|
||||
```js run
|
||||
alert( "<body> ... </body>".match(/<[a-z]+>/gi) ); // <body>
|
||||
```
|
||||
|
||||
We look for character `pattern:'<'` followed by one or more English letters, and then `pattern:'>'`.
|
||||
|
||||
Regexp "open HTML-tag without attributes" (improved): `pattern:/<[a-z][a-z0-9]*>/i`
|
||||
: Better regexp: according to the standard, HTML tag name may have a digit at any position except the first one, like `<h1>`.
|
||||
|
||||
```js run
|
||||
alert( "<h1>Hi!</h1>".match(/<[a-z][a-z0-9]*>/gi) ); // <h1>
|
||||
```
|
||||
|
||||
Regexp "opening or closing HTML-tag without attributes": `pattern:/<\/?[a-z][a-z0-9]*>/i`
|
||||
: We added an optional slash `pattern:/?` before the tag. Had to escape it with a backslash, otherwise JavaScript would think it is the pattern end.
|
||||
|
||||
```js run
|
||||
alert( "<h1>Hi!</h1>".match(/<\/?[a-z][a-z0-9]*>/gi) ); // <h1>, </h1>
|
||||
```
|
||||
|
||||
```smart header="More precise means more complex"
|
||||
We can see one common rule in these examples: the more precise is the regular expression -- the longer and more complex it is.
|
||||
|
||||
For instance, HTML tags could use a simpler regexp: `pattern:<\w+>`.
|
||||
|
||||
Because `pattern:\w` means any English letter or a digit or `'_'`, the regexp also matches non-tags, for instance `match:<_>`. But it's much simpler than `pattern:<[a-z][a-z0-9]*>`.
|
||||
|
||||
Are we ok with `pattern:<\w+>` or we need `pattern:<[a-z][a-z0-9]*>`?
|
||||
|
||||
In real life both variants are acceptable. Depends on how tolerant we can be to "extra" matches and whether it's difficult or not to filter them out by other means.
|
||||
```
|
|
@ -0,0 +1,6 @@
|
|||
|
||||
The result is: `match:123 4`.
|
||||
|
||||
First the lazy `pattern:\d+?` tries to take as little digits as it can, but it has to reach the space, so it takes `match:123`.
|
||||
|
||||
Then the second `\d+?` takes only one digit, because that's enough.
|
|
@ -0,0 +1,7 @@
|
|||
# A match for /d+? d+?/
|
||||
|
||||
What's the match here?
|
||||
|
||||
```js
|
||||
"123 456".match(/\d+? \d+?/g) ); // ?
|
||||
```
|
|
@ -0,0 +1,17 @@
|
|||
We need to find the beginning of the comment `match:<!--`, then everything till the end of `match:-->`.
|
||||
|
||||
The first idea could be `pattern:<!--.*?-->` -- the lazy quantifier makes the dot stop right before `match:-->`.
|
||||
|
||||
But a dot in Javascript means "any symbol except the newline". So multiline comments won't be found.
|
||||
|
||||
We can use `pattern:[\s\S]` instead of the dot to match "anything":
|
||||
|
||||
```js run
|
||||
let reg = /<!--[\s\S]*?-->/g;
|
||||
|
||||
let str = `... <!-- My -- comment
|
||||
test --> .. <!----> ..
|
||||
`;
|
||||
|
||||
alert( str.match(reg) ); // '<!-- My -- comment \n test -->', '<!---->'
|
||||
```
|
|
@ -0,0 +1,13 @@
|
|||
# Find HTML comments
|
||||
|
||||
Find all HTML comments in the text:
|
||||
|
||||
```js
|
||||
let reg = /your regexp/g;
|
||||
|
||||
let str = `... <!-- My -- comment
|
||||
test --> .. <!----> ..
|
||||
`;
|
||||
|
||||
alert( str.match(reg) ); // '<!-- My -- comment \n test -->', '<!---->'
|
||||
```
|
|
@ -0,0 +1,10 @@
|
|||
|
||||
The solution is `pattern:<[^<>]+>`.
|
||||
|
||||
```js run
|
||||
let reg = /<[^<>]+>/g;
|
||||
|
||||
let str = '<> <a href="/"> <input type="radio" checked> <b>';
|
||||
|
||||
alert( str.match(reg) ); // '<a href="/">', '<input type="radio" checked>', '<b>'
|
||||
```
|
|
@ -0,0 +1,15 @@
|
|||
# Find HTML tags
|
||||
|
||||
Create a regular expression to find all (opening and closing) HTML tags with their attributes.
|
||||
|
||||
An example of use:
|
||||
|
||||
```js run
|
||||
let reg = /your regexp/g;
|
||||
|
||||
let str = '<> <a href="/"> <input type="radio" checked> <b>';
|
||||
|
||||
alert( str.match(reg) ); // '<a href="/">', '<input type="radio" checked>', '<b>'
|
||||
```
|
||||
|
||||
Let's assume that may not contain `<` and `>` inside (in quotes too), that simplifies things a bit.
|
308
5-regular-expressions/08-regexp-greedy-and-lazy/article.md
Normal file
|
@ -0,0 +1,308 @@
|
|||
# Greedy and lazy quantifiers
|
||||
|
||||
Quantifiers are very simple from the first sight, but in fact they can be tricky.
|
||||
|
||||
We should understand how the search works very well if we plan to look for something more complex than `pattern:/\d+/`.
|
||||
|
||||
[cut]
|
||||
|
||||
Let's take the following task as an example.
|
||||
|
||||
We have a text and need to replace all quotes `"..."` with guillemet marks: `«...»`. They are preferred for typography in many countries.
|
||||
|
||||
For instance: `"Hello, world"` should become `«Hello, world»`.
|
||||
|
||||
Some countries prefer `„Witam, świat!”` (Polish) or even `「你好,世界」` (Chinese) quotes. For different locales we can choose different replacements, but that all works the same, so let's start with `«...»`.
|
||||
|
||||
To make replacements we first need to find all quoted substrings.
|
||||
|
||||
The regular expression can look like this: `pattern:/".+"/g`. That is: we look for a quote followed by one or more characters, and then another quote.
|
||||
|
||||
...But if we try to apply it, even in such a simple case...
|
||||
|
||||
```js run
|
||||
let reg = /".+"/g;
|
||||
|
||||
let str = 'a "witch" and her "broom" is one';
|
||||
|
||||
alert( str.match(reg) ); // "witch" and her "broom"
|
||||
```
|
||||
|
||||
...We can see that it works not as intended!
|
||||
|
||||
Instead of finding two matches `match:"witch"` and `match:"broom"`, it finds one: `match:"witch" and her "broom"`.
|
||||
|
||||
That can be described as "greediness is the cause of all evil".
|
||||
|
||||
## Greedy search
|
||||
|
||||
To find a match, the regular expression engine uses the following algorithm:
|
||||
|
||||
- For every position in the string
|
||||
- Match the pattern symbol-by-symbol using classes and quantifiers.
|
||||
- If there's no match, go to the next position.
|
||||
|
||||
These common words do not make it obvious why the regexp fails, so let's elaborate how the search works for the pattern `pattern:".+"`.
|
||||
|
||||
1. The first pattern characeter is a quote `pattern:"`.
|
||||
|
||||
The regular expression engine tries to find it on 0-th position of the source string, but there's `subject:a` there, so no match.
|
||||
|
||||
Then it advances: goes to the 1st, 2nd positions in the source string and tries to find the pattern there, and finally finds the quote at the 3rd position:
|
||||
|
||||

|
||||
|
||||
2. The quote is detected, and then the engine tries to find a match for the rest of the pattern.
|
||||
|
||||
In our case the next pattern character is `pattern:.` (a dot). It denotes "any character except a newline", so the next string letter `match:'w'` fits:
|
||||
|
||||

|
||||
|
||||
3. Then the dot repeats because of the quantifier `pattern:.+`. The regular expression engine builds the match by taking characters one by one while it is possible.
|
||||
|
||||
...When it becomes impossible? All characters match the dot, so it only stops when it reaches the end of the string:
|
||||
|
||||

|
||||
|
||||
4. Now the engine finished repeating for `pattern:.+` and tries to find the next character of the pattern. It's the quote `pattern:"`. But there's a problem: the string has finished, there are no more characters!
|
||||
|
||||
The regular expression engine understands that it took too many `pattern:.+` and starts to *backtrack*.
|
||||
|
||||
In other words, it shortens the match for the quantifier by one character:
|
||||
|
||||

|
||||
|
||||
Now it assumes that `pattern:.+` ends one character before the end and tries to match the rest of the pattern from that position.
|
||||
|
||||
If there were a quote there, then that would be the end, but the last character is `subject:'e'`, so there's no match.
|
||||
|
||||
5. ...So the engine decreases the number of repetitions of `pattern:.+` by one more character:
|
||||
|
||||

|
||||
|
||||
The quote `pattern:'"'` does not match `subject:'n'`.
|
||||
|
||||
6. The engine keep backtracking: it decreases the count of repetition for `pattern:'.'` until the rest of the pattern (in our case `pattern:'"'`) matches:
|
||||
|
||||

|
||||
|
||||
7. The match is complete.
|
||||
|
||||
8. So the first match is `match:"witch" and her "broom"`. The further search starts where the first match ends, but there are no more quotes in the rest of the string `subject:is one`, so no more results.
|
||||
|
||||
That's probably not what we expected, but that's how it works.
|
||||
|
||||
**In the greedy mode (by default) the quantifier is repeated as many times as possible.**
|
||||
|
||||
The regexp engine tries to fetch as many characters as it can by `pattern:.+`, and then shortens that one by one.
|
||||
|
||||
For our task we want another thing. That's what the lazy quantifier mode is for.
|
||||
|
||||
## Lazy mode
|
||||
|
||||
The lazy mode of quantifier is an opposite to the gredy mode. It means: "repeat minimal number of times".
|
||||
|
||||
We can enable it by putting a question mark `pattern:'?'` after the quantifier, so that it becomes `pattern:*?` or `pattern:+?` or even `pattern:??` for `pattern:'?'`.
|
||||
|
||||
To make things clear: usually a question mark `pattern:?` is a quantifier by itself (zero or one), but if added *after another quantifier (or even itself)* it gets another meaning -- it switches the matching mode from greedy to lazy.
|
||||
|
||||
The regexp `pattern:/".+?"/g` works as intended: it finds `match:"witch"` and `match:"broom"`:
|
||||
|
||||
```js run
|
||||
let reg = /".+?"/g;
|
||||
|
||||
let str = 'a "witch" and her "broom" is one';
|
||||
|
||||
alert( str.match(reg) ); // witch, broom
|
||||
```
|
||||
|
||||
To clearly understand the change, let's trace the search step by step.
|
||||
|
||||
1. The first step is the same: it finds the pattern start `pattern:'"'` at the 3rd position:
|
||||
|
||||

|
||||
|
||||
2. The next step is also similar: the engine finds a match for the dot `pattern:'.'`:
|
||||
|
||||

|
||||
|
||||
3. And now the search goes differently. Because we have a lazy mode for `pattern:+?`, the engine doesn't try to match a dot one more time, but stops and tries to match the rest of the pattern `pattern:'"'` right now:
|
||||
|
||||

|
||||
|
||||
If there were a quote there, then the search would end, but there's `'i'`, so there's no match.
|
||||
4. Then the regular expression engine increases the number of repetitions for the dot and tries one more time:
|
||||
|
||||

|
||||
|
||||
Failure again. Then the number of repetitions is increased again and again...
|
||||
5. ...Till the match for the rest of the pattern is found:
|
||||
|
||||

|
||||
|
||||
6. The next search starts from the end of the current match and yield one more result:
|
||||
|
||||

|
||||
|
||||
In this example we saw how the lazy mode works for `pattern:+?`. Quantifiers `pattern:+?` and `pattern:??` work the similar way -- the regexp engine increases the number of repetitions only if the rest of the pattern can't match on the given position.
|
||||
|
||||
**Lazyness is only enabled for the quantifier with `?`.**
|
||||
|
||||
Other quantifiers remain greedy.
|
||||
|
||||
For instance:
|
||||
|
||||
```js run
|
||||
alert( "123 456".match(/\d+ \d+?/g) ); // 123 4
|
||||
```
|
||||
|
||||
1. The pattern `pattern:\d+` tries to match as many numbers as it can (greedy mode), so it finds `match:123` and stops, because the next character is a space `pattern:' '`.
|
||||
2. Then there's a space in pattern, it matches.
|
||||
3. Then there's `pattern:\d+?`. The quantifier is in lazy mode, so it finds one digit `match:4` and tries to check if the rest of the pattern matches from there.
|
||||
|
||||
...But there's nothing in the pattern after `pattern:\d+?`.
|
||||
|
||||
The lazy mode doesn't repeat anything without a need. The pattern finished, so we're done. We have a match `match:123 4`.
|
||||
4. The next search starts from the character `5`.
|
||||
|
||||
```smart header="Optimizations"
|
||||
Modern regular expression engines can optimize internal algorithms to work faster. So they may work a bit different from the described algorithm.
|
||||
|
||||
But to understand how regular expressions work and to build regular expressions, we don't need to know about that. They are only used internally to optimize things.
|
||||
|
||||
Complex regular expressions are hard to optimize, so the search may work exactly as described as well.
|
||||
```
|
||||
|
||||
## Alternative approach
|
||||
|
||||
With regexps, there's often more then one way to do the same thing.
|
||||
|
||||
In our case we can find quoted strings without lazy mode using the regexp `pattern:"[^"]+"`:
|
||||
|
||||
```js run
|
||||
let reg = /"[^"]+"/g;
|
||||
|
||||
let str = 'a "witch" and her "broom" is one';
|
||||
|
||||
alert( str.match(reg) ); // witch, broom
|
||||
```
|
||||
|
||||
The regexp `pattern:"[^"]+"` gives correct results, because it looks for a quote `pattern:'"'` followed by one or more non-quotes `pattern:[^"]`, and then the closing quote.
|
||||
|
||||
When the regexp engine looks for `pattern:[^"]+` it stops the repetitions when it meets the closing quote, and we're done.
|
||||
|
||||
Please note, that this logic does not replace lazy quantifiers!
|
||||
|
||||
It is just different. There are times when we need one or another.
|
||||
|
||||
Let's see one more example where lazy quantifiers fail and this variant works right.
|
||||
|
||||
For instance, we want to find links of the form `<a href="..." class="doc">`, with any `href`.
|
||||
|
||||
Which regular expression to use?
|
||||
|
||||
The first idea might be: `pattern:/<a href=".*" class="doc">/g`.
|
||||
|
||||
Let's check it:
|
||||
```js run
|
||||
let str = '...<a href="link" class="doc">...';
|
||||
let reg = /<a href=".*" class="doc">/g;
|
||||
|
||||
// Works!
|
||||
alert( str.match(reg) ); // <a href="link" class="doc">
|
||||
```
|
||||
|
||||
...But what if there are many links in the text?
|
||||
|
||||
```js run
|
||||
let str = '...<a href="link1" class="doc">... <a href="link2" class="doc">...';
|
||||
let reg = /<a href=".*" class="doc">/g;
|
||||
|
||||
// Whoops! Two links in one match!
|
||||
alert( str.match(reg) ); // <a href="link1" class="doc">... <a href="link2" class="doc">
|
||||
```
|
||||
|
||||
Now the result is wrong for the same reason as our "witches" example. The quantifier `pattern:.*` took too many characters.
|
||||
|
||||
The match looks like this:
|
||||
|
||||
```html
|
||||
<a href="....................................." class="doc">
|
||||
<a href="link1" class="doc">... <a href="link2" class="doc">
|
||||
```
|
||||
|
||||
Let's modify the pattern by making the quantifier `pattern:.*?` lazy:
|
||||
|
||||
```js run
|
||||
let str = '...<a href="link1" class="doc">... <a href="link2" class="doc">...';
|
||||
let reg = /<a href=".*?" class="doc">/g;
|
||||
|
||||
// Works!
|
||||
alert( str.match(reg) ); // <a href="link1" class="doc">, <a href="link2" class="doc">
|
||||
```
|
||||
|
||||
Now it works, there are two maches:
|
||||
|
||||
```html
|
||||
<a href="....." class="doc"> <a href="....." class="doc">
|
||||
<a href="link1" class="doc">... <a href="link2" class="doc">
|
||||
```
|
||||
|
||||
Why it works -- should be obvious after all explanations above. So let's not stop on the details, but try one more text:
|
||||
|
||||
```js run
|
||||
let str = '...<a href="link1" class="wrong">... <p style="" class="doc">...';
|
||||
let reg = /<a href=".*?" class="doc">/g;
|
||||
|
||||
// Wrong match!
|
||||
alert( str.match(reg) ); // <a href="link1" class="wrong">... <p style="" class="doc">
|
||||
```
|
||||
|
||||
We can see that the regexp matched not just a link, but also a lot of text after it, including `<p...>`.
|
||||
|
||||
Why it happens?
|
||||
|
||||
1. First the regexp finds a link start `match:<a href="`.
|
||||
|
||||
2. Then it looks for `pattern:.*?`, we take one character, then check if there's a match for the rest of the pattern, then take another one...
|
||||
|
||||
The quantifier `pattern:.*?` consumes characters until it meets `match:class="doc">`.
|
||||
|
||||
...And where can it find it? If we look at the text, then we can see that the only `match:class="doc">` is beyound the link, in the tag `<p>`.
|
||||
|
||||
3. So we have match:
|
||||
|
||||
```html
|
||||
<a href="..................................." class="doc">
|
||||
<a href="link1" class="wrong">... <p style="" class="doc">
|
||||
```
|
||||
|
||||
So the lazyness did not work for us here.
|
||||
|
||||
We need the pattern to look for `<a href="...something..." class="doc">`, but both greedy and lazy variants have problems.
|
||||
|
||||
The correct variant would be: `pattern:href="[^"]*"`. It will take all characters inside the `href` attribute till the nearest quote, just what we need.
|
||||
|
||||
A working example:
|
||||
|
||||
```js run
|
||||
let str1 = '...<a href="link1" class="wrong">... <p style="" class="doc">...';
|
||||
let str2 = '...<a href="link1" class="doc">... <a href="link2" class="doc">...';
|
||||
let reg = /<a href="[^"]*" class="doc">/g;
|
||||
|
||||
// Works!
|
||||
alert( str1.match(reg) ); // null, no matches, that's correct
|
||||
alert( str2.match(reg) ); // <a href="link1" class="doc">, <a href="link2" class="doc">
|
||||
```
|
||||
|
||||
## Summary
|
||||
|
||||
Quantifiers have two modes of work:
|
||||
|
||||
Greedy
|
||||
: By default the regular expression engine tries to repeat the quantifier as many times as possible. For instance, `pattern:\d+` consumes all possible digits. When it becomes impossible to consume more (no more digits or string end), then it continues to match the rest of the pattern. If there's no match then it decreases the number of repetitions (backtracks) and tries again.
|
||||
|
||||
Lazy
|
||||
: Enabled by the question mark `pattern:?` after the quantifier. The regexp engine tries to match the rest of the pattern before each repetition of the quantifier.
|
||||
|
||||
As we've seen, the lazy mode is not a "panacea" from the greedy search. An alternative is a "fine-tuned" greedy search, with exclusions. Soon we'll see more examples of it.
|
After Width: | Height: | Size: 8.2 KiB |
After Width: | Height: | Size: 16 KiB |
After Width: | Height: | Size: 7.9 KiB |
After Width: | Height: | Size: 15 KiB |
After Width: | Height: | Size: 10 KiB |
After Width: | Height: | Size: 20 KiB |
After Width: | Height: | Size: 10 KiB |
After Width: | Height: | Size: 20 KiB |
After Width: | Height: | Size: 10 KiB |
After Width: | Height: | Size: 20 KiB |
After Width: | Height: | Size: 9.7 KiB |
After Width: | Height: | Size: 19 KiB |
BIN
5-regular-expressions/08-regexp-greedy-and-lazy/witch_lazy3.png
Normal file
After Width: | Height: | Size: 8 KiB |
After Width: | Height: | Size: 15 KiB |
BIN
5-regular-expressions/08-regexp-greedy-and-lazy/witch_lazy4.png
Normal file
After Width: | Height: | Size: 8.1 KiB |
After Width: | Height: | Size: 15 KiB |
BIN
5-regular-expressions/08-regexp-greedy-and-lazy/witch_lazy5.png
Normal file
After Width: | Height: | Size: 8.4 KiB |
After Width: | Height: | Size: 16 KiB |
BIN
5-regular-expressions/08-regexp-greedy-and-lazy/witch_lazy6.png
Normal file
After Width: | Height: | Size: 9.2 KiB |
After Width: | Height: | Size: 17 KiB |
|
@ -0,0 +1,29 @@
|
|||
A regexp to search 3-digit color `#abc`: `pattern:/#[a-f0-9]{3}/i`.
|
||||
|
||||
We can add exactly 3 more optional hex digits. We don't need more or less. Either we have them or we don't.
|
||||
|
||||
The simplest way to add them -- is to append to the regexp: `pattern:/#[a-f0-9]{3}([a-f0-9]{3})?/i`
|
||||
|
||||
We can do it in a smarter way though: `pattern:/#([a-f0-9]{3}){1,2}/i`.
|
||||
|
||||
Here the regexp `pattern:[a-f0-9]{3}` is in parentheses to apply the quantifier `pattern:{1,2}` to it as a whole.
|
||||
|
||||
In action:
|
||||
|
||||
```js run
|
||||
let reg = /#([a-f0-9]{3}){1,2}/gi;
|
||||
|
||||
let str = "color: #3f3; background-color: #AA00ef; and: #abcd";
|
||||
|
||||
alert( str.match(reg) ); // #3f3 #AA0ef #abc
|
||||
```
|
||||
|
||||
There's minor problem here: the pattern found `match:#abc` in `subject:#abcd`. To prevent that we can add `pattern:\b` to the end:
|
||||
|
||||
```js run
|
||||
let reg = /#([a-f0-9]{3}){1,2}\b/gi;
|
||||
|
||||
let str = "color: #3f3; background-color: #AA00ef; and: #abcd";
|
||||
|
||||
alert( str.match(reg) ); // #3f3 #AA0ef
|
||||
```
|
|
@ -0,0 +1,14 @@
|
|||
# Find color in the format #abc or #abcdef
|
||||
|
||||
Write a regexp that matches colors in the format `#abc` or `#abcdef`. That is: `#` followed by 3 or 6 hexadimal digits.
|
||||
|
||||
Usage example:
|
||||
```js
|
||||
let reg = /your regexp/g;
|
||||
|
||||
let str = "color: #3f3; background-color: #AA00ef; and: #abcd";
|
||||
|
||||
alert( str.match(reg) ); // #3f3 #AA0ef
|
||||
```
|
||||
|
||||
P.S. Should be exactly 3 or 6 hex digits: values like `#abcd` should not match.
|
|
@ -0,0 +1,16 @@
|
|||
|
||||
An integer number is `pattern:\d+`.
|
||||
|
||||
A decimal part is: `pattern:\.\d+`.
|
||||
|
||||
Because the decimal part is optional, let's put it in parentheses with quantifier `pattern:'?'`.
|
||||
|
||||
Finally we have the regexp: `pattern:\d+(\.\d+)?`:
|
||||
|
||||
```js run
|
||||
let reg = /\d+(\.\d+)?/g;
|
||||
|
||||
let str = "1.5 0 12. 123.4.";
|
||||
|
||||
alert( str.match(re) ); // 1.5, 0, 12, 123.4
|
||||
```
|
|
@ -0,0 +1,12 @@
|
|||
# Find positive numbers
|
||||
|
||||
Create a regexp that looks for positive numbers, including those without a decimal point.
|
||||
|
||||
An example of use:
|
||||
```js
|
||||
let reg = /your regexp/g;
|
||||
|
||||
let str = "1.5 0 12. 123.4.";
|
||||
|
||||
alert( str.match(reg) ); // 1.5, 0, 12, 123.4
|
||||
```
|
|
@ -0,0 +1,11 @@
|
|||
A positive number with an optional decimal part is (per previous task): `pattern:\d+(\.\d+)?`.
|
||||
|
||||
Let's add an optional `-` in the beginning:
|
||||
|
||||
```js run
|
||||
let reg = /-?\d+(\.\d+)?/g;
|
||||
|
||||
let str = "-1.5 0 2 -123.4.";
|
||||
|
||||
alert( str.match(reg) ); // -1.5, 0, 2, -123.4
|
||||
```
|
|
@ -0,0 +1,13 @@
|
|||
# Find all numbers
|
||||
|
||||
Write a regexp that looks for all decimal numbers including integer ones, with the floating point and negative ones.
|
||||
|
||||
An example of use:
|
||||
|
||||
```js
|
||||
let reg = /your regexp/g;
|
||||
|
||||
let str = "-1.5 0 2 -123.4.";
|
||||
|
||||
alert( str.match(re) ); // -1.5, 0, 2, -123.4
|
||||
```
|
|
@ -0,0 +1,49 @@
|
|||
A regexp for a number is: `pattern:-?\d+(\.\d+)?`. We created it in previous tasks.
|
||||
|
||||
An operator is `pattern:[-+*/]`. We put a dash `pattern:-` the first, because in the middle it would mean a character range, we don't need that.
|
||||
|
||||
Note that a slash should be escaped inside a JavaScript regexp `pattern:/.../`.
|
||||
|
||||
We need a number, an operator, and then another number. And optional spaces between them.
|
||||
|
||||
The full regular expression: `pattern:-?\d+(\.\d+)?\s*[-+*/]\s*-?\d+(\.\d+)?`.
|
||||
|
||||
To get a result as an array let's put parentheses around the data that we need: numbers and the operator: `pattern:(-?\d+(\.\d+)?)\s*([-+*/])\s*(-?\d+(\.\d+)?)`.
|
||||
|
||||
In action:
|
||||
|
||||
```js run
|
||||
let reg = /(-?\d+(\.\d+)?)\s*([-+*\/])\s*(-?\d+(\.\d+)?)/;
|
||||
|
||||
alert( "1.2 + 12".match(reg) );
|
||||
```
|
||||
|
||||
The result includes:
|
||||
|
||||
- `result[0] == "1.2 + 12"` (full match)
|
||||
- `result[1] == "1"` (first parentheses)
|
||||
- `result[2] == "2"` (second parentheses -- the decimal part `(\.\d+)?`)
|
||||
- `result[3] == "+"` (...)
|
||||
- `result[4] == "12"` (...)
|
||||
- `result[5] == undefined` (the last decimal part is absent, so it's undefined)
|
||||
|
||||
We need only numbers and the operator. We don't need decimal parts.
|
||||
|
||||
So let's remove extra groups from capturing by added `pattern:?:`, for instance: `pattern:(?:\.\d+)?`.
|
||||
|
||||
The final solution:
|
||||
|
||||
```js run
|
||||
function parse(expr) {
|
||||
let reg = /(-?\d+(?:\.\d+)?)\s*([-+*\/])\s*(-?\d+(?:\.\d+)?)/;
|
||||
|
||||
let result = expr.match(reg);
|
||||
|
||||
if (!result) return;
|
||||
result.shift();
|
||||
|
||||
return result;
|
||||
}
|
||||
|
||||
alert( parse("-1.23 * 3.45") ); // -1.23, *, 3.45
|
||||
```
|
|
@ -0,0 +1,28 @@
|
|||
# Parse an expression
|
||||
|
||||
An arithmetical expression consists of 2 numbers and an operator between them, for instance:
|
||||
|
||||
- `1 + 2`
|
||||
- `1.2 * 3.4`
|
||||
- `-3 / -6`
|
||||
- `-2 - 2`
|
||||
|
||||
The operator is one of: `"+"`, `"-"`, `"*"` or `"/"`.
|
||||
|
||||
There may be extra spaces at the beginning, at the end or between the parts.
|
||||
|
||||
Create a function `parse(expr)` that takes an expression and returns an array of 3 items:
|
||||
|
||||
1. The first number.
|
||||
2. The operator.
|
||||
3. The second number.
|
||||
|
||||
For example:
|
||||
|
||||
```js
|
||||
let [a, op, b] = parse("1.2 * 3.4");
|
||||
|
||||
alert(a); // 1.2
|
||||
alert(op); // *
|
||||
alert(b); // 3.4
|
||||
```
|
172
5-regular-expressions/09-regexp-groups/article.md
Normal file
|
@ -0,0 +1,172 @@
|
|||
# Capturing groups
|
||||
|
||||
A part of the pattern can be enclosed in parentheses `pattern:(...)`. That's called a "capturing group".
|
||||
|
||||
That has two effects:
|
||||
|
||||
1. It allows to place a part of the match into a separate array item when using [String#match](mdn:js/String/match) or [RegExp#exec](mdn:/RegExp/exec) methods.
|
||||
2. If we put a quantifier after the parentheses, it applies to the parentheses as a whole, not the last character.
|
||||
|
||||
[cut]
|
||||
|
||||
## Example
|
||||
|
||||
In the example below the pattern `pattern:(go)+` finds one or more `match:'go'`:
|
||||
|
||||
```js run
|
||||
alert( 'Gogogo now!'.match(/(go)+/i) ); // "Gogogo"
|
||||
```
|
||||
|
||||
Without parentheses, the pattern `pattern:/go+/` means `subject:g`, followed by `subject:o` repeated one or more times. For instance, `match:goooo` or `match:gooooooooo`.
|
||||
|
||||
Parentheses group the word `pattern:(go)` together.
|
||||
|
||||
Let's make something more complex -- a regexp to match an email.
|
||||
|
||||
Examples of emails:
|
||||
|
||||
```
|
||||
my@mail.com
|
||||
john.smith@site.com.uk
|
||||
```
|
||||
|
||||
The pattern: `pattern:[-.\w]+@([\w-]+\.)+[\w-]{2,20}`.
|
||||
|
||||
- The first part before `@` may include wordly characters, a dot and a dash `pattern:[-.\w]+`, like `match:john.smith`.
|
||||
- Then `pattern:@`
|
||||
- And then the domain. May be a second-level domain `site.com` or with subdomains like `host.site.com.uk`. We can match it as "a word followed by a dot" repeated one or more times for subdomains: `match:mail.` or `match:site.com.`, and then "a word" for the last part: `match:.com` or `match:.uk`.
|
||||
|
||||
The word followed by a dot is `pattern:(\w+\.)+` (repeated). The last word should not have a dot at the end, so it's just `\w{2,20}`. The quantifier `pattern:{2,20}` limits the length, because domain zones are like `.uk` or `.com` or `.museum`, but can't be longer than 20 characters.
|
||||
|
||||
So the domain pattern is `pattern:(\w+\.)+\w{2,20}`. Now we replace `\w` with `[\w-]`, because dashes are also allowed in domains, and we get the final result.
|
||||
|
||||
That regexp is not perfect, but usually works. It's short and good enough to fix errors or occasional mistypes.
|
||||
|
||||
For instance, here we can find all emails in the string:
|
||||
|
||||
```js run
|
||||
let reg = /[-.\w]+@([\w-]+\.)+[\w-]{2,20}/g;
|
||||
|
||||
alert("my@mail.com @ his@site.com.uk".match(reg)); // my@mail.com,his@site.com.uk
|
||||
```
|
||||
|
||||
|
||||
## Contents of parentheses
|
||||
|
||||
Parentheses are numbered from left to right. The search engine remembers the content of each and allows to reference it in the pattern or in the replacement string.
|
||||
|
||||
For instance, we can find an HTML-tag using a (simplified) pattern `pattern:<.*?>`. Usually we'd want to do something with the result after it.
|
||||
|
||||
If we enclose the inner contents of `<...>` into parentheses, then we can access it like this:
|
||||
|
||||
```js run
|
||||
let str = '<h1>Hello, world!</h1>';
|
||||
let reg = /<(.*?)>/;
|
||||
|
||||
alert( str.match(reg) ); // Array: ["<h1>", "h1"]
|
||||
```
|
||||
|
||||
The call to [String#match](mdn:js/String/match) returns groups only if the regexp has no `pattern:/.../g` flag.
|
||||
|
||||
If we need all matches with their groups then we can use [RegExp#exec](mdn:js/RegExp/exec) method as described in <info:regexp-methods>:
|
||||
|
||||
```js run
|
||||
let str = '<h1>Hello, world!</h1>';
|
||||
|
||||
// two matches: opening <h1> and closing </h1> tags
|
||||
let reg = /<(.*?)>/g;
|
||||
|
||||
let match;
|
||||
|
||||
while (match = reg.exec(str)) {
|
||||
// first shows the match: <h1>,h1
|
||||
// then shows the match: </h1>,/h1
|
||||
alert(match);
|
||||
}
|
||||
```
|
||||
|
||||
Here we have two matches for `pattern:<(.*?)>`, each of them is an array with the full match and groups.
|
||||
|
||||
## Nested groups
|
||||
|
||||
Parentheses can be nested. In this case the numbering also goes from left to right.
|
||||
|
||||
For instance, when searching a tag in `subject:<span class="my">` we may be interested in:
|
||||
|
||||
1. The tag content as a whole: `match:span class="my"`.
|
||||
2. The tag name: `match:span`.
|
||||
3. The tag attributes: `match:class="my"`.
|
||||
|
||||
Let's add parentheses for them:
|
||||
|
||||
```js run
|
||||
let str = '<span class="my">';
|
||||
|
||||
let reg = /<(([a-z]+)\s*([^>]*))>/;
|
||||
|
||||
let result = str.match(reg);
|
||||
alert(result); // <span class="my">, span class="my", span, class="my"
|
||||
```
|
||||
|
||||
Here's how groups look:
|
||||
|
||||

|
||||
|
||||
At the zero index of the `result` is always the full match.
|
||||
|
||||
Then groups, numbered from left to right. Whichever opens first gives the first group `result[1]`. Here it encloses the whole tag content.
|
||||
|
||||
Then in `result[2]` goes the group from the second opening `pattern:(` till the corresponding `pattern:)` -- tag name, then we don't group spaces, but group attributes for `result[3]`.
|
||||
|
||||
**If a group is optional and doesn't exist in the match, the corresponding `result` index is present (and equals `undefined`).**
|
||||
|
||||
For instance, let's consider the regexp `pattern:a(z)?(c)?`. It looks for `"a"` optionally followed by `"z"` optionally followed by `"c"`.
|
||||
|
||||
If we run it on the string with a single letter `subject:a`, then the result is:
|
||||
|
||||
```js run
|
||||
let match = 'a'.match(/a(z)?(c)?/);
|
||||
|
||||
alert( match.length ); // 3
|
||||
alert( match[0] ); // a (whole match)
|
||||
alert( match[1] ); // undefined
|
||||
alert( match[2] ); // undefined
|
||||
```
|
||||
|
||||
The array has the length of `3`, but all groups are empty.
|
||||
|
||||
And here's a more complex match for the string `subject:ack`:
|
||||
|
||||
```js run
|
||||
let match = 'ack'.match(/a(z)?(c)?/)
|
||||
|
||||
alert( match.length ); // 3
|
||||
alert( match[0] ); // ac (whole match)
|
||||
alert( match[1] ); // undefined, because there's nothing for (z)?
|
||||
alert( match[2] ); // c
|
||||
```
|
||||
|
||||
The array length is permanent: `3`. But there's nothing for the group `pattern:(z)?`, so the result is `["ac", undefined, "c"]`.
|
||||
|
||||
## Non-capturing groups with ?:
|
||||
|
||||
Sometimes we need parentheses to correctly apply a quantifier, but we don't want their contents in the array.
|
||||
|
||||
A group may be excluded by adding `pattern:?:` in the beginning.
|
||||
|
||||
For instance, if we want to find `pattern:(go)+`, but don't want to put remember the contents (`go`) in a separate array item, we can write: `pattern:(?:go)+`.
|
||||
|
||||
In the example below we only get the name "John" as a separate member of the `results` array:
|
||||
|
||||
```js run
|
||||
let str = "Gogo John!";
|
||||
*!*
|
||||
// exclude Gogo from capturing
|
||||
let reg = /(?:go)+ (\w+)/i;
|
||||
*/!*
|
||||
|
||||
let result = str.match(reg);
|
||||
|
||||
alert( result.length ); // 2
|
||||
alert( result[1] ); // John
|
||||
```
|
BIN
5-regular-expressions/09-regexp-groups/regexp-nested-groups.png
Normal file
After Width: | Height: | Size: 12 KiB |
After Width: | Height: | Size: 25 KiB |
62
5-regular-expressions/10-regexp-backreferences/article.md
Normal file
|
@ -0,0 +1,62 @@
|
|||
# Backreferences: \n and $n
|
||||
|
||||
Capturing groups may be accessed not only in the result, but in the replacement string, and in the pattern too.
|
||||
|
||||
[cut]
|
||||
|
||||
## Group in replacement: $n
|
||||
|
||||
When we are using `replace` method, we can access n-th group in the replacement string using `$n`.
|
||||
|
||||
For instance:
|
||||
|
||||
```js run
|
||||
let name = "John Smith";
|
||||
|
||||
name = name.replace(/(\w+) (\w+)/i, *!*"$2, $1"*/!*);
|
||||
alert( name ); // Smith, John
|
||||
```
|
||||
|
||||
Here `pattern:$1` in the replacement string means "substitute the content of the first group here", and `pattern:$2` means "substitute the second group here".
|
||||
|
||||
Referencing a group in the replacement string allows us to reuse the existing text during the replacement.
|
||||
|
||||
## Group in pattern: \n
|
||||
|
||||
A group can be referenced in the pattern using `\n`.
|
||||
|
||||
To make things clear let's consider a task. We need to find a quoted string: either a single-quoted `subject:'...'` or a double-quoted `subject:"..."` -- both variants need to match.
|
||||
|
||||
How to look for them?
|
||||
|
||||
We can put two kinds of quotes in the pattern: `pattern:['"](.*?)['"]`. That finds strings like `match:"..."` and `match:'...'`, but it gives incorrect matches when one quote appears inside another one, like the string `subject:"She's the one!"`:
|
||||
|
||||
```js run
|
||||
let str = "He said: \"She's the one!\".";
|
||||
|
||||
let reg = /['"](.*?)['"]/g;
|
||||
|
||||
// The result is not what we expect
|
||||
alert( str.match(reg) ); // "She'
|
||||
```
|
||||
|
||||
As we can see, the pattern found an opening quote `match:"`, then the text is consumed lazily till the other quote `match:'`, that closes the match.
|
||||
|
||||
To make sure that the pattern looks for the closing quote exactly the same as the opening one, let's make a group of it and use the backreference:
|
||||
|
||||
```js run
|
||||
let str = "He said: \"She's the one!\".";
|
||||
|
||||
let reg = /(['"])(.*?)\1/g;
|
||||
|
||||
alert( str.match(reg) ); // "She's the one!"
|
||||
```
|
||||
|
||||
Now everything's correct! The regular expression engine finds the first quote `pattern:(['"])` and remembers the content of `pattern:(...)`, that's the first capturing group.
|
||||
|
||||
Further in the pattern `pattern:\1` means "find the same text as in the first group".
|
||||
|
||||
Please note:
|
||||
|
||||
- To reference a group inside a replacement string -- we use `$1`, while in the pattern -- a backslash `\1`.
|
||||
- If we use `?:` in the group, then we can't reference it. Groups that are excluded from capturing `(?:...)` are not remembered by the engine.
|
|
@ -0,0 +1,33 @@
|
|||
|
||||
The first idea can be to list the languages with `|` in-between.
|
||||
|
||||
But that doesn't work right:
|
||||
|
||||
```js run
|
||||
let reg = /Java|JavaScript|PHP|C|C\+\+/g;
|
||||
|
||||
let str = "Java, JavaScript, PHP, C, C++";
|
||||
|
||||
alert( str.match(reg) ); // Java,Java,PHP,C,C
|
||||
```
|
||||
|
||||
The regular expression engine looks for alternations one-by-one. That is: first it checks if we have `match:Java`, otherwise -- looks for `match:JavaScript` and so on.
|
||||
|
||||
As a result, `match:JavaScript` can never be found, just because `match:Java` is checked first.
|
||||
|
||||
The same with `match:C` and `match:C++`.
|
||||
|
||||
There are two solutions for that problem:
|
||||
|
||||
1. Change the order to check the longer match first: `pattern:JavaScript|Java|C\+\+|C|PHP`.
|
||||
2. Merge variants with the same start: `pattern:Java(Script)?|C(\+\+)?|PHP`.
|
||||
|
||||
In action:
|
||||
|
||||
```js run
|
||||
let reg = /Java(Script)?|C(\+\+)?|PHP/g;
|
||||
|
||||
let str = "Java, JavaScript, PHP, C, C++";
|
||||
|
||||
alert( str.match(reg) ); // Java,JavaScript,PHP,C,C++
|
||||
```
|
|
@ -0,0 +1,11 @@
|
|||
# Find programming languages
|
||||
|
||||
There are many programming languages, for instance Java, JavaScript, PHP, C, C++.
|
||||
|
||||
Create a regexp that finds them in the string `subject:Java JavaScript PHP C++ C`:
|
||||
|
||||
```js
|
||||
let reg = /your regexp/g;
|
||||
|
||||
alert("Java JavaScript PHP C++ C".match(reg)); // Java JavaScript PHP C++ C
|
||||
```
|
|
@ -0,0 +1,23 @@
|
|||
|
||||
Opening tag is `pattern:\[(b|url|quote)\]`.
|
||||
|
||||
Then to find everything till the closing tag -- let's the pattern `pattern:[\s\S]*?` to match any character including the newline and then a backreference to the closing tag.
|
||||
|
||||
The full pattern: `pattern:\[(b|url|quote)\][\s\S]*?\[/\1\]`.
|
||||
|
||||
In action:
|
||||
|
||||
```js run
|
||||
let reg = /\[(b|url|quote)\][\s\S]*?\[\/\1\]/g;
|
||||
|
||||
let str = `
|
||||
[b]hello![/b]
|
||||
[quote]
|
||||
[url]http://google.com[/url]
|
||||
[/quote]
|
||||
`;
|
||||
|
||||
alert( str.match(reg) ); // [b]hello![/b],[quote][url]http://google.com[/url][/quote]
|
||||
```
|
||||
|
||||
Please note that we had to escape a slash for the closing tag `pattern:[/\1]`, because normally the slash closes the pattern.
|
|
@ -0,0 +1,48 @@
|
|||
# Find bbtag pairs
|
||||
|
||||
A "bb-tag" looks like `[tag]...[/tag]`, where `tag` is one of: `b`, `url` or `quote`.
|
||||
|
||||
For instance:
|
||||
```
|
||||
[b]text[/b]
|
||||
[url]http://google.com[/url]
|
||||
```
|
||||
|
||||
BB-tags can be nested. But a tag can't be nested into itself, for instance:
|
||||
|
||||
```
|
||||
Normal:
|
||||
[url] [b]http://google.com[/b] [/url]
|
||||
[quote] [b]text[/b] [/quote]
|
||||
|
||||
Impossible:
|
||||
[b][b]text[/b][/b]
|
||||
```
|
||||
|
||||
Tags can contain line breaks, that's normal:
|
||||
|
||||
```
|
||||
[quote]
|
||||
[b]text[/b]
|
||||
[/quote]
|
||||
```
|
||||
|
||||
Create a regexp to find all BB-tags with their contents.
|
||||
|
||||
For instance:
|
||||
|
||||
```js
|
||||
let reg = /your regexp/g;
|
||||
|
||||
let str = "..[url]http://google.com[/url]..";
|
||||
alert( str.match(reg) ); // [url]http://google.com[/url]
|
||||
```
|
||||
|
||||
If tags are nested, then we need the outer tag (if we want we can continue the search in its content):
|
||||
|
||||
```js
|
||||
let reg = /your regexp/g;
|
||||
|
||||
let str = "..[url][b]http://google.com[/b][/url]..";
|
||||
alert( str.match(reg) ); // [url][b]http://google.com[/b][/url]
|
||||
```
|
|
@ -0,0 +1,17 @@
|
|||
The solution: `pattern:/"(\\.|[^"\\])*"/g`.
|
||||
|
||||
Step by step:
|
||||
|
||||
- First we look for an opening quote `pattern:"`
|
||||
- Then if we have a backslash `pattern:\\` (we technically have to double it in the pattern, because it is a special character, so that's a single backslash in fact), then any character is fine after it (a dot).
|
||||
- Otherwise we take any character except a quote (that would mean the end of the string) and a backslash (to prevent lonely backslashes, the backslash is only used with some other symbol after it): `pattern:[^"\\]`
|
||||
- ...And so on till the closing quote.
|
||||
|
||||
In action:
|
||||
|
||||
```js run
|
||||
let reg = /"(\\.|[^"\\])*"/g;
|
||||
let str = ' .. "test me" .. "Say \\"Hello\\"!" .. "\\\\ \\"" .. ';
|
||||
|
||||
alert( str.match(reg) ); // "test me","Say \"Hello\"!","\\ \""
|
||||
```
|
|
@ -0,0 +1,32 @@
|
|||
# Find quoted strings
|
||||
|
||||
Create a regexp to find strings in double quotes `subject:"..."`.
|
||||
|
||||
The important part is that strings should support escaping, in the same way as JavaScript strings do. For instance, quotes can be inserted as `subject:\"` a newline as `subject:\n`, and the slash itself as `subject:\\`.
|
||||
|
||||
```js
|
||||
let str = "Just like \"here\".";
|
||||
```
|
||||
|
||||
For us it's important that an escaped quote `subject:\"` does not end a string.
|
||||
|
||||
So we should look from one quote to the other ignoring escaped quotes on the way.
|
||||
|
||||
That's the essential part of the task, otherwise it would be trivial.
|
||||
|
||||
Examples of strings to match:
|
||||
```js
|
||||
.. *!*"test me"*/!* ..
|
||||
.. *!*"Say \"Hello\"!"*/!* ... (escaped quotes inside)
|
||||
.. *!*"\\"*/!* .. (double slash inside)
|
||||
.. *!*"\\ \""*/!* .. (double slash and an escaped quote inside)
|
||||
```
|
||||
|
||||
In JavaScript we need to double the slashes to pass them right into the string, like this:
|
||||
|
||||
```js run
|
||||
let str = ' .. "test me" .. "Say \\"Hello\\"!" .. "\\\\ \\"" .. ';
|
||||
|
||||
// the in-memory string
|
||||
alert(str); // .. "test me" .. "Say \"Hello\"!" .. "\\ \"" ..
|
||||
```
|
|
@ -0,0 +1,16 @@
|
|||
|
||||
The pattern start is obvious: `pattern:<style`.
|
||||
|
||||
...But then we can't simply write `pattern:<style.*?>`, because `match:<styler>` would match it.
|
||||
|
||||
We need either a space after `match:<style` and then optionally something else or the ending `match:>`.
|
||||
|
||||
In the regexp language: `pattern:<style(>|\s.*?>)`.
|
||||
|
||||
In action:
|
||||
|
||||
```js run
|
||||
let reg = /<style(>|\s.*?>)/g;
|
||||
|
||||
alert( '<style> <styler> <style test="...">'.match(reg) ); // <style>, <style test="...">
|
||||
```
|
|
@ -0,0 +1,13 @@
|
|||
# Find the full tag
|
||||
|
||||
Write a regexp to find the tag `<style...>`. It should match the full tag: it may have no attributes `<style>` or have several of them `<style type="..." id="...">`.
|
||||
|
||||
...But the regexp should not match `<styler>`!
|
||||
|
||||
For instance:
|
||||
|
||||
```js
|
||||
let reg = /your regexp/g;
|
||||
|
||||
alert( '<style> <styler> <style test="...">'.match(reg) ); // <style>, <style test="...">
|
||||
```
|
72
5-regular-expressions/11-regexp-alternation/article.md
Normal file
|
@ -0,0 +1,72 @@
|
|||
# Alternation (OR) |
|
||||
|
||||
Alternation is the term in regular expression that is actually a simple "OR".
|
||||
|
||||
In a regular expression it is denoted with a vertial line character `pattern:|`.
|
||||
|
||||
[cut]
|
||||
|
||||
For instance, we need to find programming languages: HTML, PHP, Java or JavaScript.
|
||||
|
||||
The corresponding regexp: `pattern:html|php|java(script)?`.
|
||||
|
||||
A usage example:
|
||||
|
||||
```js run
|
||||
let reg = /html|php|css|java(script)?/gi;
|
||||
|
||||
let str = "First HTML appeared, then CSS, then JavaScript";
|
||||
|
||||
alert( str.match(reg) ); // 'HTML', 'CSS', 'JavaScript'
|
||||
```
|
||||
|
||||
We already know a similar thing -- square brackets. They allow to choose between multiple character, for instance `pattern:gr[ae]y` matches `match:gray` or `match:grey`.
|
||||
|
||||
Alternation works not on a character level, but on expression level. A regexp `pattern:A|B|C` means one of expressions `A`, `B` or `C`.
|
||||
|
||||
For instance:
|
||||
|
||||
- `pattern:gr(a|e)y` means exactly the same as `pattern:gr[ae]y`.
|
||||
- `pattern:gra|ey` means "gra" or "ey".
|
||||
|
||||
To separate a part of the pattern for alternation we usually enclose it in parentheses, like this: `pattern:before(XXX|YYY)after`.
|
||||
|
||||
## Regexp for time
|
||||
|
||||
In previous chapters there was a task to build a regexp for searching time in the form `hh:mm`, for instance `12:00`. But a simple `pattern:\d\d:\d\d` is too vague. It accepts `25:99` as the time.
|
||||
|
||||
How can we make a better one?
|
||||
|
||||
We can apply more careful matching:
|
||||
|
||||
- The first digit must be `0` or `1` followed by any digit.
|
||||
- Or `2` followed by `pattern:[0-3]`
|
||||
|
||||
As a regexp: `pattern:[01]\d|2[0-3]`.
|
||||
|
||||
Then we can add a colon and the minutes part.
|
||||
|
||||
The minutes must be from `0` to `59`, in the regexp language that means the first digit `pattern:[0-5]` followed by any other digit `\d`.
|
||||
|
||||
Let's glue them together into the pattern: `pattern:[01]\d|2[0-3]:[0-5]\d`.
|
||||
|
||||
We're almost done, but there's a problem. The alternation `|` is between the `pattern:[01]\d` and `pattern:2[0-3]:[0-5]\d`. That's wrong, because it will match either the left or the right pattern:
|
||||
|
||||
|
||||
```js run
|
||||
let reg = /[01]\d|2[0-3]:[0-5]\d/g;
|
||||
|
||||
alert("12".match(reg)); // 12 (matched [01]\d)
|
||||
```
|
||||
|
||||
That's rather obvious, but still an often mistake when starting to work with regular expressions.
|
||||
|
||||
We need to add parentheses to apply alternation exactly to hours: `[01]\d` OR `2[0-3]`.
|
||||
|
||||
The correct variant:
|
||||
|
||||
```js run
|
||||
let reg = /([01]\d|2[0-3]):[0-5]\d/g;
|
||||
|
||||
alert("00:00 10:10 23:59 25:99 1:2".match(reg)); // 00:00,10:10,23:59
|
||||
```
|
|
@ -0,0 +1,6 @@
|
|||
|
||||
The empty string is the only match: it starts and immediately finishes.
|
||||
|
||||
The task once again demonstrates that anchors are not characters, but tests.
|
||||
|
||||
The string is empty `""`. The engine first matches the `pattern:^` (input start), yes it's there, and then immediately the end `pattern:$`, it's here too. So there's a match.
|
|
@ -0,0 +1,3 @@
|
|||
# Regexp ^$
|
||||
|
||||
Which string matches the pattern `pattern:^$`?
|
|
@ -0,0 +1,21 @@
|
|||
A two-digit hex number is `pattern:[0-9a-f]{2}` (assuming the `pattern:i` flag is enabled).
|
||||
|
||||
We need that number `NN`, and then `:NN` repeated 5 times (more numbers);
|
||||
|
||||
The regexp is: `pattern:[0-9a-f]{2}(:[0-9a-f]{2}){5}`
|
||||
|
||||
Now let's show that the match should capture all the text: start at the beginning and end at the end. That's done by wrapping the pattern in `pattern:^...$`.
|
||||
|
||||
Finally:
|
||||
|
||||
```js run
|
||||
let reg = /^[0-9a-fA-F]{2}(:[0-9a-fA-F]{2}){5}$/i;
|
||||
|
||||
alert( reg.test('01:32:54:67:89:AB') ); // true
|
||||
|
||||
alert( reg.test('0132546789AB') ); // false (no colons)
|
||||
|
||||
alert( reg.test('01:32:54:67:89') ); // false (5 numbers, need 6)
|
||||
|
||||
alert( reg.test('01:32:54:67:89:ZZ') ) // false (ZZ in the end)
|
||||
```
|
20
5-regular-expressions/12-regexp-anchors/2-test-mac/task.md
Normal file
|
@ -0,0 +1,20 @@
|
|||
# Check MAC-address
|
||||
|
||||
[MAC-address](https://en.wikipedia.org/wiki/MAC_address) of a network interface consists of 6 two-digit hex numbers separated by a colon.
|
||||
|
||||
For instance: `subject:'01:32:54:67:89:AB'`.
|
||||
|
||||
Write a regexp that checks whether a string is MAC-address.
|
||||
|
||||
Usage:
|
||||
```js
|
||||
let reg = /your regexp/;
|
||||
|
||||
alert( reg.test('01:32:54:67:89:AB') ); // true
|
||||
|
||||
alert( reg.test('0132546789AB') ); // false (no colons)
|
||||
|
||||
alert( reg.test('01:32:54:67:89') ); // false (5 numbers, must be 6)
|
||||
|
||||
alert( reg.test('01:32:54:67:89:ZZ') ) // false (ZZ ad the end)
|
||||
```
|
57
5-regular-expressions/12-regexp-anchors/article.md
Normal file
|
@ -0,0 +1,57 @@
|
|||
# String start ^ and finish $
|
||||
|
||||
The caret `pattern:'^'` and dollar `pattern:'$'` characters have special meaning in a regexp. They are called "anchors".
|
||||
|
||||
[cut]
|
||||
|
||||
The caret `pattern:^` matches at the end of the text, and the dollar `pattern:$` -- in the end.
|
||||
|
||||
For instance, let's test if the text starts with `Mary`:
|
||||
|
||||
```js run
|
||||
let str1 = 'Mary had a little lamb, it's fleece was white as snow';
|
||||
let str2 = 'Everywhere Mary went, the lamp was sure to go';
|
||||
|
||||
alert( /^Mary/.test(str1) ); // true
|
||||
alert( /^Mary/.test(str2) ); // false
|
||||
```
|
||||
|
||||
The pattern `pattern:^Mary` means: "the string start and then Mary".
|
||||
|
||||
Now let's test whether the text ends with an email.
|
||||
|
||||
To match an email, we can use a regexp `pattern:[-.\w]+@([\w-]+\.)+[\w-]{2,20}`. It's not perfect, but mostly works.
|
||||
|
||||
To test whether the string ends with the email, let's add `pattern:$` to the pattern:
|
||||
|
||||
```js run
|
||||
let reg = /[-.\w]+@([\w-]+\.)+[\w-]{2,20}$/g;
|
||||
|
||||
let str1 = 'My email is mail@site.com';
|
||||
let str2 = 'Everywhere Mary went, the lamp was sure to go';
|
||||
|
||||
alert( reg.test(str1) ); // true
|
||||
alert( reg.test(str2) ); // false
|
||||
```
|
||||
|
||||
We can use both anchors together to check whether the string exactly follows the pattern. That's often used for validation.
|
||||
|
||||
For instance we want to check that `str` is exactly a color in the form `#` plus 6 hex digits. The pattern for the color is `pattern:#[0-9a-f]{6}`.
|
||||
|
||||
To check that the *whole string* exactly matches it, we add `pattern:^...$`:
|
||||
|
||||
```js run
|
||||
let str = "#abcdef";
|
||||
|
||||
alert( /^#[0-9a-f]{6}$/i.test(str) ); // true
|
||||
```
|
||||
|
||||
The regexp engine looks for the text start, then the color, and then immediately the text end. Just what we need.
|
||||
|
||||
```smart header="Anchors have zero length"
|
||||
Anchors just like `\b` are tests. They have zero-width.
|
||||
|
||||
In other words, they do not match a character, but rather force the regexp engine to check the condition (text start/end).
|
||||
```
|
||||
|
||||
The behavior of anchors changes if there's a flag `pattern:m` (multiline mode). We'll explore it in the next chapter.
|
78
5-regular-expressions/13-regexp-multiline-mode/article.md
Normal file
|
@ -0,0 +1,78 @@
|
|||
# Multiline mode, flag "m"
|
||||
|
||||
The multiline mode is enabled by the flag `pattern:/.../m`.
|
||||
|
||||
[cut]
|
||||
|
||||
It only affects the behavior of `pattern:^` and `pattern:$`.
|
||||
|
||||
In the multiline mode they match not only at the beginning and end of the string, but also at start/end of line.
|
||||
|
||||
## Line start ^
|
||||
|
||||
In the example below the text has multiple lines. The pattern `pattern:/^\d+/gm` takes a number from the beginning of each one:
|
||||
|
||||
```js run
|
||||
let str = `1st place: Winnie
|
||||
2nd place: Piglet
|
||||
33rd place: Eeyore`;
|
||||
|
||||
*!*
|
||||
alert( str.match(/^\d+/gm) ); // 1, 2, 33
|
||||
*/!*
|
||||
```
|
||||
|
||||
Without the flag `pattern:/.../m` only the first number is matched:
|
||||
|
||||
|
||||
```js run
|
||||
let str = `1st place: Winnie
|
||||
2nd place: Piglet
|
||||
33rd place: Eeyore`;
|
||||
|
||||
*!*
|
||||
alert( str.match(/^\d+/g) ); // 1
|
||||
*/!*
|
||||
```
|
||||
|
||||
That's because by default a caret `pattern:^` only matches at the beginning of the text, and in the multiline mode -- at the start of a line.
|
||||
|
||||
The regular expression engine moves along the text and looks for a string start `pattern:^`, when finds -- continues to match the rest of the pattern `pattern:\d+`.
|
||||
|
||||
## Line end $
|
||||
|
||||
The dollar sign `pattern:$` behaves similarly.
|
||||
|
||||
The regular expression `pattern:\w+$` finds the last word in every line
|
||||
|
||||
```js run
|
||||
let str = `1st place: Winnie
|
||||
2nd place: Piglet
|
||||
33rd place: Eeyore`;
|
||||
|
||||
alert( str.match(/\w+$/gim) ); // Winnie,Piglet,Eeyore
|
||||
```
|
||||
|
||||
Without the `pattern:/.../m` flag the dollar `pattern:$` would only match the end of the whole string, so only the very last word would be found.
|
||||
|
||||
## Anchors ^$ versus \n
|
||||
|
||||
To find a newline, we can use not only `pattern:^` and `pattern:$`, but also the newline character `\n`.
|
||||
|
||||
The first difference is that unlike anchors, the character `\n` "consumes" the newline character and adds it to the result.
|
||||
|
||||
For instance, here we use it instead of `pattern:$`:
|
||||
|
||||
```js run
|
||||
let str = `1st place: Winnie
|
||||
2nd place: Piglet
|
||||
33rd place: Eeyore`;
|
||||
|
||||
alert( str.match(/\w+\n/gim) ); // Winnie\n,Piglet\n
|
||||
```
|
||||
|
||||
Here every match is a word plus a newline character.
|
||||
|
||||
And one more difference -- the newline `\n` does not match at the string end. That's why `Eeyore` is not found in the example above.
|
||||
|
||||
So, anchors are usually better, they are closer to what we want to get.
|
3
5-regular-expressions/14-regexp-lookahead/article.md
Normal file
|
@ -0,0 +1,3 @@
|
|||
# Lookahead (in progress)
|
||||
|
||||
The article is under development, will be here when it's ready.
|
|
@ -0,0 +1,272 @@
|
|||
# Infinite backtracking problem
|
||||
|
||||
Some regular expressions are looking simple, but can execute veeeeeery long time, and even "hang" the JavaScript engine.
|
||||
|
||||
Sooner or later all developers occasionally meets this behavior.
|
||||
|
||||
The typical situation -- a regular expression works fine for some time, and then starts to "hang" the script and make it consume 100% of CPU.
|
||||
|
||||
That may even be a vulnerability. For instance, if JavaScript is on the server and uses regular expressions on user data. There were many vulnerabilities of that kind even in widely distributed systems.
|
||||
|
||||
So the problem is definitely worth to deal with.
|
||||
|
||||
[cut]
|
||||
|
||||
## Example
|
||||
|
||||
The plan will be like this:
|
||||
|
||||
1. First we see the problem how it may occur.
|
||||
2. Then we simplify the situation and see why it occurs.
|
||||
3. Then we fix it.
|
||||
|
||||
For instance let's consider searching tags in HTML.
|
||||
|
||||
We want to find all tags, with or without attributes -- like `subject:<a href="..." class="doc" ...>`. We need the regexp to work reliably, because HTML comes from the internet and can be messy.
|
||||
|
||||
In particular, we need it to match tags like `<a test="<>" href="#">` -- with `<` and `>` in attributes. That's allowed by [HTML standard](https://html.spec.whatwg.org/multipage/syntax.html#syntax-attributes).
|
||||
|
||||
Now we can see that a simple regexp like `pattern:<[^>]+>` doesn't work, because it stops at the first `>`, and we need to ignore `<>` inside an attribute.
|
||||
|
||||
```js run
|
||||
// the match doesn't reach the end of the tag - wrong!
|
||||
alert( '<a test="<>" href="#">'.match(/<[^>]+>/) ); // <a test="<>
|
||||
```
|
||||
|
||||
We need the whole tag.
|
||||
|
||||
To correctly handle such situations we need a more complex regular expression. It will have the form `pattern:<tag (key=value)*>`.
|
||||
|
||||
In the regexp language that is: `pattern:<\w+(\s*\w+=(\w+|"[^"]*")\s*)*>`:
|
||||
|
||||
1. `pattern:<\w+` -- is the tag start,
|
||||
2. `pattern:(\s*\w+=(\w+|"[^"]*")\s*)*` -- is an arbitrary number of pairs `word=value`, where the value can be either a word `pattern:\w+` or a quoted string `pattern:"[^"]*"`.
|
||||
|
||||
That doesn't yet support the details of HTML grammer, for instance strings can be in 'single' quotes, but these can be added later, so that's somewhat close to real life. For now we want the regexp to be simple.
|
||||
|
||||
Let's try it in action:
|
||||
|
||||
```js run
|
||||
let reg = /<\w+(\s*\w+=(\w+|"[^"]*")\s*)*>/g;
|
||||
|
||||
let str='...<a test="<>" href="#">... <b>...';
|
||||
|
||||
alert( str.match(reg) ); // <a test="<>" href="#">, <b>
|
||||
```
|
||||
|
||||
Great, it works! It found both the long tag `match:<a test="<>" href="#">` and the short one `match:<b>`.
|
||||
|
||||
Now let's see the problem.
|
||||
|
||||
If you run the example below, it may hang the browser (or another JavaScript engine):
|
||||
|
||||
```js run
|
||||
let reg = /<\w+(\s*\w+=(\w+|"[^"]*")\s*)*>/g;
|
||||
|
||||
let str = `<tag a=b a=b a=b a=b a=b a=b a=b a=b
|
||||
a=b a=b a=b a=b a=b a=b a=b a=b a=b a=b a=b a=b a=b`;
|
||||
|
||||
*!*
|
||||
// The search will take a long long time
|
||||
alert( str.match(reg) );
|
||||
*/!*
|
||||
```
|
||||
|
||||
Some regexp engines can handle that search, but most of them don't.
|
||||
|
||||
What's the matter? Why a simple regular expression on such a small string "hangs"?
|
||||
|
||||
Let's simplify the situation by removing the tag and quoted strings, we'll look only for attributes:
|
||||
|
||||
```js run
|
||||
// only search for space-delimited attributes
|
||||
let reg = /<(\s*\w+=\w+\s*)*>/g;
|
||||
|
||||
let str = `<a=b a=b a=b a=b a=b a=b a=b a=b
|
||||
a=b a=b a=b a=b a=b a=b a=b a=b a=b a=b a=b a=b a=b`;
|
||||
|
||||
*!*
|
||||
// the search will take a long, long time
|
||||
alert( str.match(reg) );
|
||||
*/!*
|
||||
```
|
||||
|
||||
The same.
|
||||
|
||||
Here we end the demo of the problem and start looking into what's going on.
|
||||
|
||||
## Backtracking
|
||||
|
||||
To make an example even simpler, let's consider `pattern:(\d+)*$`.
|
||||
|
||||
In most regexp engines that search takes a very long time (careful -- can hang):
|
||||
|
||||
```js run
|
||||
alert( '12345678901234567890123456789123456789z'.match(/(\d+)*$/) );
|
||||
```
|
||||
|
||||
So what's wrong with the regexp?
|
||||
|
||||
Actually, it looks a little bit strange. The quantifier `pattern:*` looks extraneous. If we want a number, we can use `pattern:\d+$`.
|
||||
|
||||
Yes, the regexp is artificial, but the reason why it is slow is the same as those we saw above. So let's understand it.
|
||||
|
||||
What happen during the search of `pattern:(\d+)*$` in the line `subject:123456789z`?
|
||||
|
||||
1. First, the regexp engine tries to find a number `pattern:\d+`. The plus `pattern:+` is greedy by default, so it consumes all digits:
|
||||
|
||||
```
|
||||
\d+.......
|
||||
(123456789)z
|
||||
```
|
||||
2. Then it tries to apply the start around the parentheses `pattern:(\d+)*`, but there are no more digits, so it the star doesn't give anything.
|
||||
|
||||
Then the pattern has the string end anchor `pattern:$`, and in the text we have `subject:z`.
|
||||
|
||||
```
|
||||
X
|
||||
\d+........$
|
||||
(123456789)z
|
||||
```
|
||||
|
||||
No match!
|
||||
3. There's no match, so the greedy quantifier `pattern:+` decreases the count of repetitions (backtracks).
|
||||
|
||||
Now `\d+` is not all digits, but all except the last one:
|
||||
```
|
||||
\d+.......
|
||||
(12345678)9z
|
||||
```
|
||||
4. Now the engine tries to continue the search from the new position (`9`).
|
||||
|
||||
The start `pattern:(\d+)*` can now be applied -- it gives the number `match:9`:
|
||||
|
||||
```
|
||||
|
||||
\d+.......\d+
|
||||
(12345678)(9)z
|
||||
```
|
||||
|
||||
The engine tries to match `$` again, but fails, because meets `subject:z`:
|
||||
|
||||
```
|
||||
X
|
||||
\d+.......\d+
|
||||
(12345678)(9)z
|
||||
```
|
||||
|
||||
There's no match, so the engine will continue backtracking.
|
||||
5. Now the first number `pattern:\d+` will have 7 digits, and the rest of the string `subject:89` becomes the second `pattern:\d+`:
|
||||
|
||||
```
|
||||
X
|
||||
\d+......\d+
|
||||
(1234567)(89)z
|
||||
```
|
||||
|
||||
...Still no match for `pattern:$`.
|
||||
|
||||
The search engine backtracks again. Backtracking generally works like this: the last greedy quantifier decreases the number of repetitions until it can. Then the previous greedy quantifier decreases, and so on. In our case the last greedy quantifier is the second `pattern:\d+`, from `subject:89` to `subject:8`, and then the star takes `subject:9`:
|
||||
|
||||
```
|
||||
X
|
||||
\d+......\d+\d+
|
||||
(1234567)(8)(9)z
|
||||
```
|
||||
6. ...Fail again. The second and third `pattern:\d+` backtracked to the end, so the first quantifier shortens the match to `subject:123456`, and the star takes the rest:
|
||||
|
||||
```
|
||||
X
|
||||
\d+.......\d+
|
||||
(123456)(789)z
|
||||
```
|
||||
|
||||
Again no match. The process repeats: the last greedy quantifier releases one character (`9`):
|
||||
|
||||
```
|
||||
X
|
||||
\d+.....\d+ \d+
|
||||
(123456)(78)(9)z
|
||||
```
|
||||
7. ...And so on.
|
||||
|
||||
The regular expression engine goes through all combinations of `123456789` and their subsequences. There are a lot of them, that's why it takes so long.
|
||||
|
||||
A smart guy can say here: "Backtracking? Let's turn on the lazy mode -- and no more backtracking!".
|
||||
|
||||
Let's replace `pattern:\d+` with `pattern:\d+?` and see if it works (careful, can hang the browser)
|
||||
|
||||
```js run
|
||||
// sloooooowwwwww
|
||||
alert( '12345678901234567890123456789123456789z'.match(/(\d+?)*$/) );
|
||||
```
|
||||
|
||||
No, it doesn't.
|
||||
|
||||
Lazy quantifiers actually do the same, but in the reverse order. Just think about how the search engine would work in this case.
|
||||
|
||||
Some regular expression engines have tricky built-in checks to detect infinite backtracking or other means to work around them, but there's no universal solution.
|
||||
|
||||
In the example above, when we search `pattern:<(\s*\w+=\w+\s*)*>` in the string `subject:<a=b a=b a=b a=b` -- the similar thing happens.
|
||||
|
||||
The string has no `>` at the end, so the match is impossible, but the regexp engine does not know about it. The search backtracks trying different combinations of `pattern:(\s*\w+=\w+\s*)`:
|
||||
|
||||
```
|
||||
(a=b a=b a=b) (a=b)
|
||||
(a=b a=b) (a=b a=b)
|
||||
...
|
||||
```
|
||||
|
||||
## How to fix?
|
||||
|
||||
The problem -- too many variants in backtracking even if we don't need them.
|
||||
|
||||
For instance, in the pattern `pattern:(\d+)*$` we (people) can easily see that `pattern:(\d+)` does not need to backtrack.
|
||||
|
||||
Decreasing the count of `pattern:\d+` can not help to find a match, there's no matter between these two:
|
||||
|
||||
```
|
||||
\d+........
|
||||
(123456789)z
|
||||
|
||||
\d+...\d+....
|
||||
(1234)(56789)z
|
||||
```
|
||||
|
||||
Let's get back to more real-life example: `pattern:<(\s*\w+=\w+\s*)*>`. We want it to find pairs `name=value` (as many as it can). There's no need in backtracking here.
|
||||
|
||||
In other words, if it found many `name=value` pairs and then can't find `>`, then there's no need to decrease the count of repetitions. Even if we match one pair less, it won't give us the closing `>`:
|
||||
|
||||
Modern regexp engines support so-called "possessive" quantifiers for that. They are like greedy, but don't backtrack at all. Pretty simple, they capture whatever they can, and the search continues. There's also another tool called "atomic groups" that forbid backtracking inside parentheses.
|
||||
|
||||
Unfortunately, but both these features are not supported by JavaScript.
|
||||
|
||||
Although we can get a similar affect using lookahead. There's more about the relation between possessive quantifiers and lookahead in articles [Regex: Emulate Atomic Grouping (and Possessive Quantifiers) with LookAhead](http://instanceof.me/post/52245507631/regex-emulate-atomic-grouping-with-lookahead) and [Mimicking Atomic Groups](http://blog.stevenlevithan.com/archives/mimic-atomic-groups).
|
||||
|
||||
The pattern to take as much repetitions as possible without backtracking is: `pattern:(?=(a+))\1`.
|
||||
|
||||
In other words, the lookahead `pattern:?=` looks for the maximal count `pattern:a+` from the current position. And then they are "consumed into the result" by the backreference `pattern:\1`.
|
||||
|
||||
There will be no backtracking, because lookahead does not backtrack. If it found like 5 times of `pattern:a+` and the further match failed, then it doesn't go back to 4.
|
||||
|
||||
Let's fix the regexp for a tag with attributes from the beginning of the chapter`pattern:<\w+(\s*\w+=(\w+|"[^"]*")\s*)*>`. We'll use lookahead to prevent backtracking of `name=value` pairs:
|
||||
|
||||
```js run
|
||||
// regexp to search name=value
|
||||
let attrReg = /(\s*\w+=(\w+|"[^"]*")\s*)/
|
||||
|
||||
// use it inside the regexp for tag
|
||||
let reg = new RegExp('<\\w+(?=(' + attrReg.source + '*))\\1>', 'g');
|
||||
|
||||
let good = '...<a test="<>" href="#">... <b>...';
|
||||
|
||||
let bad = `<tag a=b a=b a=b a=b a=b a=b a=b a=b
|
||||
a=b a=b a=b a=b a=b a=b a=b a=b a=b a=b a=b a=b a=b`;
|
||||
|
||||
alert( good.match(reg) ); // <a test="<>" href="#">, <b>
|
||||
alert( bad.match(reg) ); // null (no results, fast!)
|
||||
```
|
||||
|
||||
Great, it works! We found a long tag `match:<a test="<>" href="#">` and a small one `match:<b>` and didn't hang the engine.
|
||||
|
||||
Please note the `attrReg.source` property. `RegExp` objects provide access to their source string in it. That's convenient when we want to insert one regexp into another.
|
3
5-regular-expressions/index.md
Normal file
|
@ -0,0 +1,3 @@
|
|||
# Regular expressions
|
||||
|
||||
Regular expressions is a powerful way of doing search and replace in strings.
|