Merge branch 'refactor'
127
9-regular-expressions/01-regexp-introduction/article.md
Normal file
|
@ -0,0 +1,127 @@
|
|||
# Patterns and flags
|
||||
|
||||
Regular expressions is a powerful way of searching and replacing inside a string.
|
||||
|
||||
In JavaScript regular expressions are implemented using objects of a built-in `RegExp` class and integrated with strings.
|
||||
|
||||
Please note that regular expressions vary between programming languages. In this tutorial we concentrate on JavaScript. Of course there's a lot in common, but they are a somewhat different in Perl, Ruby, PHP etc.
|
||||
|
||||
## Regular expressions
|
||||
|
||||
A regular expression (also "regexp", or just "reg") consists of a *pattern* and optional *flags*.
|
||||
|
||||
There are two syntaxes to create a regular expression object.
|
||||
|
||||
The long syntax:
|
||||
|
||||
```js
|
||||
regexp = new RegExp("pattern", "flags");
|
||||
```
|
||||
|
||||
...And the short one, using slashes `"/"`:
|
||||
|
||||
```js
|
||||
regexp = /pattern/; // no flags
|
||||
regexp = /pattern/gmi; // with flags g,m and i (to be covered soon)
|
||||
```
|
||||
|
||||
Slashes `"/"` tell JavaScript that we are creating a regular expression. They play the same role as quotes for strings.
|
||||
|
||||
## Usage
|
||||
|
||||
To search inside a string, we can use method [search](mdn:js/String/search).
|
||||
|
||||
Here's an example:
|
||||
|
||||
```js run
|
||||
let str = "I love JavaScript!"; // will search here
|
||||
|
||||
let regexp = /love/;
|
||||
alert( str.search(regexp) ); // 2
|
||||
```
|
||||
|
||||
The `str.search` method looks for the pattern `pattern:/love/` and returns the position inside the string. As we might guess, `pattern:/love/` is the simplest possible pattern. What it does is a simple substring search.
|
||||
|
||||
The code above is the same as:
|
||||
|
||||
```js run
|
||||
let str = "I love JavaScript!"; // will search here
|
||||
|
||||
let substr = 'love';
|
||||
alert( str.search(substr) ); // 2
|
||||
```
|
||||
|
||||
So searching for `pattern:/love/` is the same as searching for `"love"`.
|
||||
|
||||
But that's only for now. Soon we'll create more complex regular expressions with much more searching power.
|
||||
|
||||
```smart header="Colors"
|
||||
From here on the color scheme is:
|
||||
|
||||
- regexp -- `pattern:red`
|
||||
- string (where we search) -- `subject:blue`
|
||||
- result -- `match:green`
|
||||
```
|
||||
|
||||
|
||||
````smart header="When to use `new RegExp`?"
|
||||
Normally we use the short syntax `/.../`. But it does not allow any variable insertions, so we must know the exact regexp at the time of writing the code.
|
||||
|
||||
On the other hand, `new RegExp` allows to construct a pattern dynamically from a string.
|
||||
|
||||
So we can figure out what we need to search and create `new RegExp` from it:
|
||||
|
||||
```js run
|
||||
let search = prompt("What you want to search?", "love");
|
||||
let regexp = new RegExp(search);
|
||||
|
||||
// find whatever the user wants
|
||||
alert( "I love JavaScript".search(regexp));
|
||||
```
|
||||
````
|
||||
|
||||
|
||||
## Flags
|
||||
|
||||
Regular expressions may have flags that affect the search.
|
||||
|
||||
There are only 5 of them in JavaScript:
|
||||
|
||||
`i`
|
||||
: With this flag the search is case-insensitive: no difference between `A` and `a` (see the example below).
|
||||
|
||||
`g`
|
||||
: With this flag the search looks for all matches, without it -- only the first one (we'll see uses in the next chapter).
|
||||
|
||||
`m`
|
||||
: Multiline mode (covered in the chapter <info:regexp-multiline-mode>).
|
||||
|
||||
`s`
|
||||
: "Dotall" mode, allows `.` to match newlines (covered in the chapter <info:regexp-character-classes>).
|
||||
|
||||
`u`
|
||||
: Enables full unicode support. The flag enables correct processing of surrogate pairs. More about that in the chapter <info:regexp-unicode>.
|
||||
|
||||
`y`
|
||||
: Sticky mode (covered in the chapter <info:regexp-sticky>)
|
||||
|
||||
We'll cover all these flags further in the tutorial.
|
||||
|
||||
For now, the simplest flag is `i`, here's an example:
|
||||
|
||||
```js run
|
||||
let str = "I love JavaScript!";
|
||||
|
||||
alert( str.search(/LOVE/i) ); // 2 (found lowercased)
|
||||
|
||||
alert( str.search(/LOVE/) ); // -1 (nothing found without 'i' flag)
|
||||
```
|
||||
|
||||
So the `i` flag already makes regular expressions more powerful than a simple substring search. But there's so much more. We'll cover other flags and features in the next chapters.
|
||||
|
||||
|
||||
## Summary
|
||||
|
||||
- A regular expression consists of a pattern and optional flags: `g`, `i`, `m`, `u`, `s`, `y`.
|
||||
- Without flags and special symbols that we'll study later, the search by a regexp is the same as a substring search.
|
||||
- The method `str.search(regexp)` returns the index where the match is found or `-1` if there's no match. In the next chapter we'll see other methods.
|
458
9-regular-expressions/02-regexp-methods/article.md
Normal file
|
@ -0,0 +1,458 @@
|
|||
# Methods of RegExp and String
|
||||
|
||||
There are two sets of methods to deal with regular expressions.
|
||||
|
||||
1. First, regular expressions are objects of the built-in [RegExp](mdn:js/RegExp) class, it provides many methods.
|
||||
2. Besides that, there are methods in regular strings can work with regexps.
|
||||
|
||||
|
||||
## Recipes
|
||||
|
||||
Which method to use depends on what we'd like to do.
|
||||
|
||||
Methods become much easier to understand if we separate them by their use in real-life tasks:
|
||||
|
||||
**To search for all matches:**
|
||||
|
||||
Use regexp `g` flag and:
|
||||
- Get a flat array of matches -- `str.match(reg)`
|
||||
- Get an array or matches with details -- `str.matchAll(reg)`.
|
||||
|
||||
**To search for the first match only:**
|
||||
- Get the full first match -- `str.match(reg)` (without `g` flag).
|
||||
- Get the string position of the first match -- `str.search(reg)`.
|
||||
- Check if there's a match -- `regexp.test(str)`.
|
||||
- Find the match from the given position -- `regexp.exec(str)` (set `regexp.lastIndex` to position).
|
||||
|
||||
**To replace all matches:**
|
||||
- Replace with another string or a function result -- `str.replace(reg, str|func)`
|
||||
|
||||
**To split the string by a separator:**
|
||||
- `str.split(str|reg)`
|
||||
|
||||
Now you get the details about every method in this chapter... But if you're reading for the first time, and want to know more about regexps - go ahead!
|
||||
|
||||
You may want to skip methods for now, move on to the next chapter, and then return here if something about a method is unclear.
|
||||
|
||||
## str.search(reg)
|
||||
|
||||
We've seen this method already. It returns the position of the first match or `-1` if none found:
|
||||
|
||||
```js run
|
||||
let str = "A drop of ink may make a million think";
|
||||
|
||||
alert( str.search( *!*/a/i*/!* ) ); // 0 (the first position)
|
||||
```
|
||||
|
||||
**The important limitation: `search` only finds the first match.**
|
||||
|
||||
We can't find next positions using `search`, there's just no syntax for that. But there are other methods that can.
|
||||
|
||||
## str.match(reg), no "g" flag
|
||||
|
||||
The behavior of `str.match` varies depending on whether `reg` has `g` flag or not.
|
||||
|
||||
First, if there's no `g` flag, then `str.match(reg)` looks for the first match only.
|
||||
|
||||
The result is an array with that match and additional properties:
|
||||
|
||||
- `index` -- the position of the match inside the string,
|
||||
- `input` -- the subject string.
|
||||
|
||||
For instance:
|
||||
|
||||
```js run
|
||||
let str = "Fame is the thirst of youth";
|
||||
|
||||
let result = str.match( *!*/fame/i*/!* );
|
||||
|
||||
alert( result[0] ); // Fame (the match)
|
||||
alert( result.index ); // 0 (at the zero position)
|
||||
alert( result.input ); // "Fame is the thirst of youth" (the string)
|
||||
```
|
||||
|
||||
A match result may have more than one element.
|
||||
|
||||
**If a part of the pattern is delimited by parentheses `(...)`, then it becomes a separate element in the array.**
|
||||
|
||||
If parentheses have a name, designated by `(?<name>...)` at their start, then `result.groups[name]` has the content. We'll see that later in the chapter [about groups](info:regexp-groups).
|
||||
|
||||
For instance:
|
||||
|
||||
```js run
|
||||
let str = "JavaScript is a programming language";
|
||||
|
||||
let result = str.match( *!*/JAVA(SCRIPT)/i*/!* );
|
||||
|
||||
alert( result[0] ); // JavaScript (the whole match)
|
||||
alert( result[1] ); // script (the part of the match that corresponds to the parentheses)
|
||||
alert( result.index ); // 0
|
||||
alert( result.input ); // JavaScript is a programming language
|
||||
```
|
||||
|
||||
Due to the `i` flag the search is case-insensitive, so it finds `match:JavaScript`. The part of the match that corresponds to `pattern:SCRIPT` becomes a separate array item.
|
||||
|
||||
So, this method is used to find one full match with all details.
|
||||
|
||||
|
||||
## str.match(reg) with "g" flag
|
||||
|
||||
When there's a `"g"` flag, then `str.match` returns an array of all matches. There are no additional properties in that array, and parentheses do not create any elements.
|
||||
|
||||
For instance:
|
||||
|
||||
```js run
|
||||
let str = "HO-Ho-ho!";
|
||||
|
||||
let result = str.match( *!*/ho/ig*/!* );
|
||||
|
||||
alert( result ); // HO, Ho, ho (array of 3 matches, case-insensitive)
|
||||
```
|
||||
|
||||
Parentheses do not change anything, here we go:
|
||||
|
||||
```js run
|
||||
let str = "HO-Ho-ho!";
|
||||
|
||||
let result = str.match( *!*/h(o)/ig*/!* );
|
||||
|
||||
alert( result ); // HO, Ho, ho
|
||||
```
|
||||
|
||||
**So, with `g` flag `str.match` returns a simple array of all matches, without details.**
|
||||
|
||||
If we want to get information about match positions and contents of parentheses then we should use `matchAll` method that we'll cover below.
|
||||
|
||||
````warn header="If there are no matches, `str.match` returns `null`"
|
||||
Please note, that's important. If there are no matches, the result is not an empty array, but `null`.
|
||||
|
||||
Keep that in mind to evade pitfalls like this:
|
||||
|
||||
```js run
|
||||
let str = "Hey-hey-hey!";
|
||||
|
||||
alert( str.match(/Z/g).length ); // Error: Cannot read property 'length' of null
|
||||
```
|
||||
|
||||
Here `str.match(/Z/g)` is `null`, it has no `length` property.
|
||||
````
|
||||
|
||||
## str.matchAll(regexp)
|
||||
|
||||
The method `str.matchAll(regexp)` is used to find all matches with all details.
|
||||
|
||||
For instance:
|
||||
|
||||
```js run
|
||||
let str = "Javascript or JavaScript? Should we uppercase 'S'?";
|
||||
|
||||
let result = str.matchAll( *!*/java(script)/ig*/!* );
|
||||
|
||||
let [match1, match2] = result;
|
||||
|
||||
alert( match1[0] ); // Javascript (the whole match)
|
||||
alert( match1[1] ); // script (the part of the match that corresponds to the parentheses)
|
||||
alert( match1.index ); // 0
|
||||
alert( match1.input ); // = str (the whole original string)
|
||||
|
||||
alert( match2[0] ); // JavaScript (the whole match)
|
||||
alert( match2[1] ); // Script (the part of the match that corresponds to the parentheses)
|
||||
alert( match2.index ); // 14
|
||||
alert( match2.input ); // = str (the whole original string)
|
||||
```
|
||||
|
||||
````warn header="`matchAll` returns an iterable, not array"
|
||||
For instance, if we try to get the first match by index, it won't work:
|
||||
|
||||
```js run
|
||||
let str = "Javascript or JavaScript??";
|
||||
|
||||
let result = str.matchAll( /javascript/ig );
|
||||
|
||||
*!*
|
||||
alert(result[0]); // undefined (?! there must be a match)
|
||||
*/!*
|
||||
```
|
||||
|
||||
The reason is that the iterator is not an array. We need to run `Array.from(result)` on it, or use `for..of` loop to get matches.
|
||||
|
||||
In practice, if we need all matches, then `for..of` works, so it's not a problem.
|
||||
|
||||
And, to get only few matches, we can use destructuring:
|
||||
|
||||
```js run
|
||||
let str = "Javascript or JavaScript??";
|
||||
|
||||
*!*
|
||||
let [firstMatch] = str.matchAll( /javascript/ig );
|
||||
*/!*
|
||||
|
||||
alert(firstMatch); // Javascript
|
||||
```
|
||||
````
|
||||
|
||||
```warn header="`matchAll` is supernew, may need a polyfill"
|
||||
The method may not work in old browsers. A polyfill might be needed (this site uses core-js).
|
||||
|
||||
Or you could make a loop with `regexp.exec`, explained below.
|
||||
```
|
||||
|
||||
## str.split(regexp|substr, limit)
|
||||
|
||||
Splits the string using the regexp (or a substring) as a delimiter.
|
||||
|
||||
We already used `split` with strings, like this:
|
||||
|
||||
```js run
|
||||
alert('12-34-56'.split('-')) // array of [12, 34, 56]
|
||||
```
|
||||
|
||||
But we can split by a regular expression, the same way:
|
||||
|
||||
```js run
|
||||
alert('12-34-56'.split(/-/)) // array of [12, 34, 56]
|
||||
```
|
||||
|
||||
## str.replace(str|reg, str|func)
|
||||
|
||||
That's actually a great method, one of most useful ones. The swiss army knife for searching and replacing.
|
||||
|
||||
The simplest use -- searching and replacing a substring, like this:
|
||||
|
||||
```js run
|
||||
// replace a dash by a colon
|
||||
alert('12-34-56'.replace("-", ":")) // 12:34-56
|
||||
```
|
||||
|
||||
There's a pitfall though.
|
||||
|
||||
**When the first argument of `replace` is a string, it only looks for the first match.**
|
||||
|
||||
You can see that in the example above: only the first `"-"` is replaced by `":"`.
|
||||
|
||||
To find all dashes, we need to use not the string `"-"`, but a regexp `pattern:/-/g`, with an obligatory `g` flag:
|
||||
|
||||
```js run
|
||||
// replace all dashes by a colon
|
||||
alert( '12-34-56'.replace( *!*/-/g*/!*, ":" ) ) // 12:34:56
|
||||
```
|
||||
|
||||
The second argument is a replacement string. We can use special characters in it:
|
||||
|
||||
| Symbol | Inserts |
|
||||
|--------|--------|
|
||||
|`$$`|`"$"` |
|
||||
|`$&`|the whole match|
|
||||
|<code>$`</code>|a part of the string before the match|
|
||||
|`$'`|a part of the string after the match|
|
||||
|`$n`|if `n` is a 1-2 digit number, then it means the contents of n-th parentheses counting from left to right, otherwise it means a parentheses with the given name |
|
||||
|
||||
|
||||
For instance if we use `$&` in the replacement string, that means "put the whole match here".
|
||||
|
||||
Let's use it to prepend all entries of `"John"` with `"Mr."`:
|
||||
|
||||
```js run
|
||||
let str = "John Doe, John Smith and John Bull";
|
||||
|
||||
// for each John - replace it with Mr. and then John
|
||||
alert(str.replace(/John/g, 'Mr.$&')); // Mr.John Doe, Mr.John Smith and Mr.John Bull
|
||||
```
|
||||
|
||||
Quite often we'd like to reuse parts of the source string, recombine them in the replacement or wrap into something.
|
||||
|
||||
To do so, we should:
|
||||
1. First, mark the parts by parentheses in regexp.
|
||||
2. Use `$1`, `$2` (and so on) in the replacement string to get the content matched by parentheses.
|
||||
|
||||
For instance:
|
||||
|
||||
```js run
|
||||
let str = "John Smith";
|
||||
|
||||
// swap first and last name
|
||||
alert(str.replace(/(john) (smith)/i, '$2, $1')) // Smith, John
|
||||
```
|
||||
|
||||
**For situations that require "smart" replacements, the second argument can be a function.**
|
||||
|
||||
It will be called for each match, and its result will be inserted as a replacement.
|
||||
|
||||
For instance:
|
||||
|
||||
```js run
|
||||
let i = 0;
|
||||
|
||||
// replace each "ho" by the result of the function
|
||||
alert("HO-Ho-ho".replace(/ho/gi, function() {
|
||||
return ++i;
|
||||
})); // 1-2-3
|
||||
```
|
||||
|
||||
In the example above the function just returns the next number every time, but usually the result is based on the match.
|
||||
|
||||
The function is called with arguments `func(str, p1, p2, ..., pn, offset, input, groups)`:
|
||||
|
||||
1. `str` -- the match,
|
||||
2. `p1, p2, ..., pn` -- contents of parentheses (if there are any),
|
||||
3. `offset` -- position of the match,
|
||||
4. `input` -- the source string,
|
||||
5. `groups` -- an object with named groups (see chapter [](info:regexp-groups)).
|
||||
|
||||
If there are no parentheses in the regexp, then there are only 3 arguments: `func(str, offset, input)`.
|
||||
|
||||
Let's use it to show full information about matches:
|
||||
|
||||
```js run
|
||||
// show and replace all matches
|
||||
function replacer(str, offset, input) {
|
||||
alert(`Found ${str} at position ${offset} in string ${input}`);
|
||||
return str.toLowerCase();
|
||||
}
|
||||
|
||||
let result = "HO-Ho-ho".replace(/ho/gi, replacer);
|
||||
alert( 'Result: ' + result ); // Result: ho-ho-ho
|
||||
|
||||
// shows each match:
|
||||
// Found HO at position 0 in string HO-Ho-ho
|
||||
// Found Ho at position 3 in string HO-Ho-ho
|
||||
// Found ho at position 6 in string HO-Ho-ho
|
||||
```
|
||||
|
||||
In the example below there are two parentheses, so `replacer` is called with 5 arguments: `str` is the full match, then parentheses, and then `offset` and `input`:
|
||||
|
||||
```js run
|
||||
function replacer(str, name, surname, offset, input) {
|
||||
// name is the first parentheses, surname is the second one
|
||||
return surname + ", " + name;
|
||||
}
|
||||
|
||||
let str = "John Smith";
|
||||
|
||||
alert(str.replace(/(John) (Smith)/, replacer)) // Smith, John
|
||||
```
|
||||
|
||||
Using a function gives us the ultimate replacement power, because it gets all the information about the match, has access to outer variables and can do everything.
|
||||
|
||||
## regexp.exec(str)
|
||||
|
||||
We've already seen these searching methods:
|
||||
|
||||
- `search` -- looks for the position of the match,
|
||||
- `match` -- if there's no `g` flag, returns the first match with parentheses and all details,
|
||||
- `match` -- if there's a `g` flag -- returns all matches, without details parentheses,
|
||||
- `matchAll` -- returns all matches with details.
|
||||
|
||||
The `regexp.exec` method is the most flexible searching method of all. Unlike previous methods, `exec` should be called on a regexp, rather than on a string.
|
||||
|
||||
It behaves differently depending on whether the regexp has the `g` flag.
|
||||
|
||||
If there's no `g`, then `regexp.exec(str)` returns the first match, exactly as `str.match(reg)`. Such behavior does not give us anything new.
|
||||
|
||||
But if there's `g`, then:
|
||||
- `regexp.exec(str)` returns the first match and *remembers* the position after it in `regexp.lastIndex` property.
|
||||
- The next call starts to search from `regexp.lastIndex` and returns the next match.
|
||||
- If there are no more matches then `regexp.exec` returns `null` and `regexp.lastIndex` is set to `0`.
|
||||
|
||||
We could use it to get all matches with their positions and parentheses groups in a loop, instead of `matchAll`:
|
||||
|
||||
```js run
|
||||
let str = 'A lot about JavaScript at https://javascript.info';
|
||||
|
||||
let regexp = /javascript/ig;
|
||||
|
||||
let result;
|
||||
|
||||
while (result = regexp.exec(str)) {
|
||||
alert( `Found ${result[0]} at ${result.index}` );
|
||||
// shows: Found JavaScript at 12, then:
|
||||
// shows: Found javascript at 34
|
||||
}
|
||||
```
|
||||
|
||||
Surely, `matchAll` does the same, at least for modern browsers. But what `matchAll` can't do -- is to search from a given position.
|
||||
|
||||
Let's search from position `13`. What we need is to assign `regexp.lastIndex=13` and call `regexp.exec`:
|
||||
|
||||
```js run
|
||||
let str = "A lot about JavaScript at https://javascript.info";
|
||||
|
||||
let regexp = /javascript/ig;
|
||||
*!*
|
||||
regexp.lastIndex = 13;
|
||||
*/!*
|
||||
|
||||
let result;
|
||||
|
||||
while (result = regexp.exec(str)) {
|
||||
alert( `Found ${result[0]} at ${result.index}` );
|
||||
// shows: Found javascript at 34
|
||||
}
|
||||
```
|
||||
|
||||
Now, starting from the given position `13`, there's only one match.
|
||||
|
||||
|
||||
## regexp.test(str)
|
||||
|
||||
The method `regexp.test(str)` looks for a match and returns `true/false` whether it finds it.
|
||||
|
||||
For instance:
|
||||
|
||||
```js run
|
||||
let str = "I love JavaScript";
|
||||
|
||||
// these two tests do the same
|
||||
alert( *!*/love/i*/!*.test(str) ); // true
|
||||
alert( str.search(*!*/love/i*/!*) != -1 ); // true
|
||||
```
|
||||
|
||||
An example with the negative answer:
|
||||
|
||||
```js run
|
||||
let str = "Bla-bla-bla";
|
||||
|
||||
alert( *!*/love/i*/!*.test(str) ); // false
|
||||
alert( str.search(*!*/love/i*/!*) != -1 ); // false
|
||||
```
|
||||
|
||||
If the regexp has `'g'` flag, then `regexp.test` advances `regexp.lastIndex` property, just like `regexp.exec`.
|
||||
|
||||
So we can use it to search from a given position:
|
||||
|
||||
```js run
|
||||
let regexp = /love/gi;
|
||||
|
||||
let str = "I love JavaScript";
|
||||
|
||||
// start the search from position 10:
|
||||
regexp.lastIndex = 10
|
||||
alert( regexp.test(str) ); // false (no match)
|
||||
```
|
||||
|
||||
|
||||
|
||||
````warn header="Same global regexp tested repeatedly may fail to match"
|
||||
If we apply the same global regexp to different inputs, it may lead to wrong result, because `regexp.test` call advances `regexp.lastIndex` property, so next matches start from non-zero position.
|
||||
|
||||
For instance, here we call `regexp.test` twice on the same text, and the second time fails:
|
||||
|
||||
```js run
|
||||
let regexp = /javascript/g; // (regexp just created: regexp.lastIndex=0)
|
||||
|
||||
alert( regexp.test("javascript") ); // true (regexp.lastIndex=10 now)
|
||||
alert( regexp.test("javascript") ); // false
|
||||
```
|
||||
|
||||
That's exactly because `regexp.lastIndex` is non-zero on the second test.
|
||||
|
||||
To work around that, one could use non-global regexps or re-adjust `regexp.lastIndex=0` before a new search.
|
||||
````
|
||||
|
||||
## Summary
|
||||
|
||||
There's a variety of many methods on both regexps and strings.
|
||||
|
||||
Their abilities and methods overlap quite a bit, we can do the same by different calls. Sometimes that may cause confusion when starting to learn the language.
|
||||
|
||||
Then please refer to the recipes at the beginning of this chapter, as they provide solutions for the majority of regexp-related tasks.
|
|
@ -0,0 +1,6 @@
|
|||
|
||||
The answer: `pattern:\b\d\d:\d\d\b`.
|
||||
|
||||
```js run
|
||||
alert( "Breakfast at 09:00 in the room 123:456.".match( /\b\d\d:\d\d\b/ ) ); // 09:00
|
||||
```
|
|
@ -0,0 +1,8 @@
|
|||
# Find the time
|
||||
|
||||
The time has a format: `hours:minutes`. Both hours and minutes has two digits, like `09:00`.
|
||||
|
||||
Make a regexp to find time in the string: `subject:Breakfast at 09:00 in the room 123:456.`
|
||||
|
||||
P.S. In this task there's no need to check time correctness yet, so `25:99` can also be a valid result.
|
||||
P.P.S. The regexp shouldn't match `123:456`.
|
265
9-regular-expressions/03-regexp-character-classes/article.md
Normal file
|
@ -0,0 +1,265 @@
|
|||
# Character classes
|
||||
|
||||
Consider a practical task -- we have a phone number `"+7(903)-123-45-67"`, and we need to turn it into pure numbers: `79035419441`.
|
||||
|
||||
To do so, we can find and remove anything that's not a number. Character classes can help with that.
|
||||
|
||||
A character class is a special notation that matches any symbol from a certain set.
|
||||
|
||||
For the start, let's explore a "digit" class. It's written as `\d`. We put it in the pattern, that means "any single digit".
|
||||
|
||||
For instance, the let's find the first digit in the phone number:
|
||||
|
||||
```js run
|
||||
let str = "+7(903)-123-45-67";
|
||||
|
||||
let reg = /\d/;
|
||||
|
||||
alert( str.match(reg) ); // 7
|
||||
```
|
||||
|
||||
Without the flag `g`, the regular expression only looks for the first match, that is the first digit `\d`.
|
||||
|
||||
Let's add the `g` flag to find all digits:
|
||||
|
||||
```js run
|
||||
let str = "+7(903)-123-45-67";
|
||||
|
||||
let reg = /\d/g;
|
||||
|
||||
alert( str.match(reg) ); // array of matches: 7,9,0,3,1,2,3,4,5,6,7
|
||||
|
||||
alert( str.match(reg).join('') ); // 79035419441
|
||||
```
|
||||
|
||||
That was a character class for digits. There are other character classes as well.
|
||||
|
||||
Most used are:
|
||||
|
||||
`\d` ("d" is from "digit")
|
||||
: A digit: a character from `0` to `9`.
|
||||
|
||||
`\s` ("s" is from "space")
|
||||
: A space symbol: that includes spaces, tabs, newlines.
|
||||
|
||||
`\w` ("w" is from "word")
|
||||
: A "wordly" character: either a letter of English alphabet or a digit or an underscore. Non-english letters (like cyrillic or hindi) do not belong to `\w`.
|
||||
|
||||
For instance, `pattern:\d\s\w` means a "digit" followed by a "space character" followed by a "wordly character", like `"1 a"`.
|
||||
|
||||
**A regexp may contain both regular symbols and character classes.**
|
||||
|
||||
For instance, `pattern:CSS\d` matches a string `match:CSS` with a digit after it:
|
||||
|
||||
```js run
|
||||
let str = "CSS4 is cool";
|
||||
let reg = /CSS\d/
|
||||
|
||||
alert( str.match(reg) ); // CSS4
|
||||
```
|
||||
|
||||
Also we can use many character classes:
|
||||
|
||||
```js run
|
||||
alert( "I love HTML5!".match(/\s\w\w\w\w\d/) ); // 'HTML5'
|
||||
```
|
||||
|
||||
The match (each character class corresponds to one result character):
|
||||
|
||||

|
||||
|
||||
## Word boundary: \b
|
||||
|
||||
A word boundary `pattern:\b` -- is a special character class.
|
||||
|
||||
It does not denote a character, but rather a boundary between characters.
|
||||
|
||||
For instance, `pattern:\bJava\b` matches `match:Java` in the string `subject:Hello, Java!`, but not in the script `subject:Hello, JavaScript!`.
|
||||
|
||||
```js run
|
||||
alert( "Hello, Java!".match(/\bJava\b/) ); // Java
|
||||
alert( "Hello, JavaScript!".match(/\bJava\b/) ); // null
|
||||
```
|
||||
|
||||
The boundary has "zero width" in a sense that usually a character class means a character in the result (like a wordly character or a digit), but not in this case.
|
||||
|
||||
The boundary is a test.
|
||||
|
||||
When regular expression engine is doing the search, it's moving along the string in an attempt to find the match. At each string position it tries to find the pattern.
|
||||
|
||||
When the pattern contains `pattern:\b`, it tests that the position in string is a word boundary, that is one of three variants:
|
||||
|
||||
- Immediately before is `\w`, and immediately after -- not `\w`, or vise versa.
|
||||
- At string start, and the first string character is `\w`.
|
||||
- At string end, and the last string character is `\w`.
|
||||
|
||||
For instance, in the string `subject:Hello, Java!` the following positions match `\b`:
|
||||
|
||||

|
||||
|
||||
So it matches `pattern:\bHello\b`, because:
|
||||
|
||||
1. At the beginning of the string the first `\b` test matches.
|
||||
2. Then the word `Hello` matches.
|
||||
3. Then `\b` matches, as we're between `o` and a space.
|
||||
|
||||
Pattern `pattern:\bJava\b` also matches. But not `pattern:\bHell\b` (because there's no word boundary after `l`) and not `Java!\b` (because the exclamation sign is not a wordly character, so there's no word boundary after it).
|
||||
|
||||
|
||||
```js run
|
||||
alert( "Hello, Java!".match(/\bHello\b/) ); // Hello
|
||||
alert( "Hello, Java!".match(/\bJava\b/) ); // Java
|
||||
alert( "Hello, Java!".match(/\bHell\b/) ); // null (no match)
|
||||
alert( "Hello, Java!".match(/\bJava!\b/) ); // null (no match)
|
||||
```
|
||||
|
||||
Once again let's note that `pattern:\b` makes the searching engine to test for the boundary, so that `pattern:Java\b` finds `match:Java` only when followed by a word boundary, but it does not add a letter to the result. §
|
||||
|
||||
Usually we use `\b` to find standalone English words. So that if we want `"Java"` language then `pattern:\bJava\b` finds exactly a standalone word and ignores it when it's a part of `"JavaScript"`.
|
||||
|
||||
Another example: a regexp `pattern:\b\d\d\b` looks for standalone two-digit numbers. In other words, it requires that before and after `pattern:\d\d` must be a symbol different from `\w` (or beginning/end of the string).
|
||||
|
||||
```js run
|
||||
alert( "1 23 456 78".match(/\b\d\d\b/g) ); // 23,78
|
||||
```
|
||||
|
||||
```warn header="Word boundary doesn't work for non-English alphabets"
|
||||
The word boundary check `\b` tests for a boundary between `\w` and something else. But `\w` means an English letter (or a digit or an underscore), so the test won't work for other characters (like cyrillic or hieroglyphs).
|
||||
```
|
||||
|
||||
|
||||
## Inverse classes
|
||||
|
||||
For every character class there exists an "inverse class", denoted with the same letter, but uppercased.
|
||||
|
||||
The "reverse" means that it matches all other characters, for instance:
|
||||
|
||||
`\D`
|
||||
: Non-digit: any character except `\d`, for instance a letter.
|
||||
|
||||
`\S`
|
||||
: Non-space: any character except `\s`, for instance a letter.
|
||||
|
||||
`\W`
|
||||
: Non-wordly character: anything but `\w`.
|
||||
|
||||
`\B`
|
||||
: Non-boundary: a test reverse to `\b`.
|
||||
|
||||
In the beginning of the chapter we saw how to get all digits from the phone `subject:+7(903)-123-45-67`.
|
||||
|
||||
One way was to match all digits and join them:
|
||||
|
||||
```js run
|
||||
let str = "+7(903)-123-45-67";
|
||||
|
||||
alert( str.match(/\d/g).join('') ); // 79031234567
|
||||
```
|
||||
|
||||
An alternative, shorter way is to find non-digits `\D` and remove them from the string:
|
||||
|
||||
|
||||
```js run
|
||||
let str = "+7(903)-123-45-67";
|
||||
|
||||
alert( str.replace(/\D/g, "") ); // 79031234567
|
||||
```
|
||||
|
||||
## Spaces are regular characters
|
||||
|
||||
Usually we pay little attention to spaces. For us strings `subject:1-5` and `subject:1 - 5` are nearly identical.
|
||||
|
||||
But if a regexp doesn't take spaces into account, it may fail to work.
|
||||
|
||||
Let's try to find digits separated by a dash:
|
||||
|
||||
```js run
|
||||
alert( "1 - 5".match(/\d-\d/) ); // null, no match!
|
||||
```
|
||||
|
||||
Here we fix it by adding spaces into the regexp `pattern:\d - \d`:
|
||||
|
||||
```js run
|
||||
alert( "1 - 5".match(/\d - \d/) ); // 1 - 5, now it works
|
||||
```
|
||||
|
||||
**A space is a character. Equal in importance with any other character.**
|
||||
|
||||
Of course, spaces in a regexp are needed only if we look for them. Extra spaces (just like any other extra characters) may prevent a match:
|
||||
|
||||
```js run
|
||||
alert( "1-5".match(/\d - \d/) ); // null, because the string 1-5 has no spaces
|
||||
```
|
||||
|
||||
In other words, in a regular expression all characters matter, spaces too.
|
||||
|
||||
## A dot is any character
|
||||
|
||||
The dot `"."` is a special character class that matches "any character except a newline".
|
||||
|
||||
For instance:
|
||||
|
||||
```js run
|
||||
alert( "Z".match(/./) ); // Z
|
||||
```
|
||||
|
||||
Or in the middle of a regexp:
|
||||
|
||||
```js run
|
||||
let reg = /CS.4/;
|
||||
|
||||
alert( "CSS4".match(reg) ); // CSS4
|
||||
alert( "CS-4".match(reg) ); // CS-4
|
||||
alert( "CS 4".match(reg) ); // CS 4 (space is also a character)
|
||||
```
|
||||
|
||||
Please note that the dot means "any character", but not the "absense of a character". There must be a character to match it:
|
||||
|
||||
```js run
|
||||
alert( "CS4".match(/CS.4/) ); // null, no match because there's no character for the dot
|
||||
```
|
||||
|
||||
### The dotall "s" flag
|
||||
|
||||
Usually a dot doesn't match a newline character.
|
||||
|
||||
For instance, this doesn't match:
|
||||
|
||||
```js run
|
||||
alert( "A\nB".match(/A.B/) ); // null (no match)
|
||||
|
||||
// a space character would match
|
||||
// or a letter, but not \n
|
||||
```
|
||||
|
||||
Sometimes it's inconvenient, we really want "any character", newline included.
|
||||
|
||||
That's what `s` flag does. If a regexp has it, then the dot `"."` match literally any character:
|
||||
|
||||
```js run
|
||||
alert( "A\nB".match(/A.B/s) ); // A\nB (match!)
|
||||
```
|
||||
|
||||
|
||||
## Summary
|
||||
|
||||
There exist following character classes:
|
||||
|
||||
- `pattern:\d` -- digits.
|
||||
- `pattern:\D` -- non-digits.
|
||||
- `pattern:\s` -- space symbols, tabs, newlines.
|
||||
- `pattern:\S` -- all but `pattern:\s`.
|
||||
- `pattern:\w` -- English letters, digits, underscore `'_'`.
|
||||
- `pattern:\W` -- all but `pattern:\w`.
|
||||
- `pattern:.` -- any character if with the regexp `'s'` flag, otherwise any except a newline.
|
||||
|
||||
...But that's not all!
|
||||
|
||||
Modern Javascript also allows to look for characters by their Unicode properties, for instance:
|
||||
|
||||
- A cyrillic letter is: `pattern:\p{Script=Cyrillic}` or `pattern:\p{sc=Cyrillic}`.
|
||||
- A dash (be it a small hyphen `-` or a long dash `—`): `pattern:\p{Dash_Punctuation}` or `pattern:\p{pd}`.
|
||||
- A currency symbol: `pattern:\p{Currency_Symbol}` or `pattern:\p{sc}`.
|
||||
- ...And much more. Unicode has a lot of character categories that we can select from.
|
||||
|
||||
These patterns require `'u'` regexp flag to work. More about that in the chapter [](info:regexp-unicode).
|
After Width: | Height: | Size: 3.6 KiB |
After Width: | Height: | Size: 7.5 KiB |
After Width: | Height: | Size: 4 KiB |
After Width: | Height: | Size: 8.6 KiB |
99
9-regular-expressions/04-regexp-escaping/article.md
Normal file
|
@ -0,0 +1,99 @@
|
|||
|
||||
# Escaping, special characters
|
||||
|
||||
As we've seen, a backslash `"\"` is used to denote character classes. So it's a special character in regexps (just like in a regular string).
|
||||
|
||||
There are other special characters as well, that have special meaning in a regexp. They are used to do more powerful searches. Here's a full list of them: `pattern:[ \ ^ $ . | ? * + ( )`.
|
||||
|
||||
Don't try to remember the list -- soon we'll deal with each of them separately and you'll know them by heart automatically.
|
||||
|
||||
## Escaping
|
||||
|
||||
Let's say we want to find a dot literally. Not "any character", but just a dot.
|
||||
|
||||
To use a special character as a regular one, prepend it with a backslash: `pattern:\.`.
|
||||
|
||||
That's also called "escaping a character".
|
||||
|
||||
For example:
|
||||
```js run
|
||||
alert( "Chapter 5.1".match(/\d\.\d/) ); // 5.1 (match!)
|
||||
alert( "Chapter 511".match(/\d\.\d/) ); // null (looking for a real dot \.)
|
||||
```
|
||||
|
||||
Parentheses are also special characters, so if we want them, we should use `pattern:\(`. The example below looks for a string `"g()"`:
|
||||
|
||||
```js run
|
||||
alert( "function g()".match(/g\(\)/) ); // "g()"
|
||||
```
|
||||
|
||||
If we're looking for a backslash `\`, it's a special character in both regular strings and regexps, so we should double it.
|
||||
|
||||
```js run
|
||||
alert( "1\\2".match(/\\/) ); // '\'
|
||||
```
|
||||
|
||||
## A slash
|
||||
|
||||
A slash symbol `'/'` is not a special character, but in JavaScript it is used to open and close the regexp: `pattern:/...pattern.../`, so we should escape it too.
|
||||
|
||||
Here's what a search for a slash `'/'` looks like:
|
||||
|
||||
```js run
|
||||
alert( "/".match(/\//) ); // '/'
|
||||
```
|
||||
|
||||
On the other hand, if we're not using `/.../`, but create a regexp using `new RegExp`, then we don't need to escape it:
|
||||
|
||||
```js run
|
||||
alert( "/".match(new RegExp("/")) ); // '/'
|
||||
```
|
||||
|
||||
## new RegExp
|
||||
|
||||
If we are creating a regular expression with `new RegExp`, then we don't have to escape `/`, but need to do some other escaping.
|
||||
|
||||
For instance, consider this:
|
||||
|
||||
```js run
|
||||
let reg = new RegExp("\d\.\d");
|
||||
|
||||
alert( "Chapter 5.1".match(reg) ); // null
|
||||
```
|
||||
|
||||
It worked with `pattern:/\d\.\d/`, but with `new RegExp("\d\.\d")` it doesn't, why?
|
||||
|
||||
The reason is that backslashes are "consumed" by a string. Remember, regular strings have their own special characters like `\n`, and a backslash is used for escaping.
|
||||
|
||||
Please, take a look, what "\d\.\d" really is:
|
||||
|
||||
```js run
|
||||
alert("\d\.\d"); // d.d
|
||||
```
|
||||
|
||||
The quotes "consume" backslashes and interpret them, for instance:
|
||||
|
||||
- `\n` -- becomes a newline character,
|
||||
- `\u1234` -- becomes the Unicode character with such code,
|
||||
- ...And when there's no special meaning: like `\d` or `\z`, then the backslash is simply removed.
|
||||
|
||||
So the call to `new RegExp` gets a string without backslashes. That's why it doesn't work!
|
||||
|
||||
To fix it, we need to double backslashes, because quotes turn `\\` into `\`:
|
||||
|
||||
```js run
|
||||
*!*
|
||||
let regStr = "\\d\\.\\d";
|
||||
*/!*
|
||||
alert(regStr); // \d\.\d (correct now)
|
||||
|
||||
let reg = new RegExp(regStr);
|
||||
|
||||
alert( "Chapter 5.1".match(reg) ); // 5.1
|
||||
```
|
||||
|
||||
## Summary
|
||||
|
||||
- To search special characters `pattern:[ \ ^ $ . | ? * + ( )` literally, we need to prepend them with `\` ("escape them").
|
||||
- We also need to escape `/` if we're inside `pattern:/.../` (but not inside `new RegExp`).
|
||||
- When passing a string `new RegExp`, we need to double backslashes `\\`, cause strings consume one of them.
|
|
@ -0,0 +1,12 @@
|
|||
Answers: **no, yes**.
|
||||
|
||||
- In the script `subject:Java` it doesn't match anything, because `pattern:[^script]` means "any character except given ones". So the regexp looks for `"Java"` followed by one such symbol, but there's a string end, no symbols after it.
|
||||
|
||||
```js run
|
||||
alert( "Java".match(/Java[^script]/) ); // null
|
||||
```
|
||||
- Yes, because the regexp is case-insensitive, the `pattern:[^script]` part matches the character `"S"`.
|
||||
|
||||
```js run
|
||||
alert( "JavaScript".match(/Java[^script]/) ); // "JavaS"
|
||||
```
|
|
@ -0,0 +1,5 @@
|
|||
# Java[^script]
|
||||
|
||||
We have a regexp `pattern:/Java[^script]/`.
|
||||
|
||||
Does it match anything in the string `subject:Java`? In the string `subject:JavaScript`?
|
|
@ -0,0 +1,8 @@
|
|||
Answer: `pattern:\d\d[-:]\d\d`.
|
||||
|
||||
```js run
|
||||
let reg = /\d\d[-:]\d\d/g;
|
||||
alert( "Breakfast at 09:00. Dinner at 21-30".match(reg) ); // 09:00, 21-30
|
||||
```
|
||||
|
||||
Please note that the dash `pattern:'-'` has a special meaning in square brackets, but only between other characters, not when it's in the beginning or at the end, so we don't need to escape it.
|
|
@ -0,0 +1,12 @@
|
|||
# Find the time as hh:mm or hh-mm
|
||||
|
||||
The time can be in the format `hours:minutes` or `hours-minutes`. Both hours and minutes have 2 digits: `09:00` or `21-30`.
|
||||
|
||||
Write a regexp to find time:
|
||||
|
||||
```js
|
||||
let reg = /your regexp/g;
|
||||
alert( "Breakfast at 09:00. Dinner at 21-30".match(reg) ); // 09:00, 21-30
|
||||
```
|
||||
|
||||
P.S. In this task we assume that the time is always correct, there's no need to filter out bad strings like "45:67". Later we'll deal with that too.
|
|
@ -0,0 +1,114 @@
|
|||
# Sets and ranges [...]
|
||||
|
||||
Several characters or character classes inside square brackets `[…]` mean to "search for any character among given".
|
||||
|
||||
## Sets
|
||||
|
||||
For instance, `pattern:[eao]` means any of the 3 characters: `'a'`, `'e'`, or `'o'`.
|
||||
|
||||
That's called a *set*. Sets can be used in a regexp along with regular characters:
|
||||
|
||||
```js run
|
||||
// find [t or m], and then "op"
|
||||
alert( "Mop top".match(/[tm]op/gi) ); // "Mop", "top"
|
||||
```
|
||||
|
||||
Please note that although there are multiple characters in the set, they correspond to exactly one character in the match.
|
||||
|
||||
So the example above gives no matches:
|
||||
|
||||
```js run
|
||||
// find "V", then [o or i], then "la"
|
||||
alert( "Voila".match(/V[oi]la/) ); // null, no matches
|
||||
```
|
||||
|
||||
The pattern assumes:
|
||||
|
||||
- `pattern:V`,
|
||||
- then *one* of the letters `pattern:[oi]`,
|
||||
- then `pattern:la`.
|
||||
|
||||
So there would be a match for `match:Vola` or `match:Vila`.
|
||||
|
||||
## Ranges
|
||||
|
||||
Square brackets may also contain *character ranges*.
|
||||
|
||||
For instance, `pattern:[a-z]` is a character in range from `a` to `z`, and `pattern:[0-5]` is a digit from `0` to `5`.
|
||||
|
||||
In the example below we're searching for `"x"` followed by two digits or letters from `A` to `F`:
|
||||
|
||||
```js run
|
||||
alert( "Exception 0xAF".match(/x[0-9A-F][0-9A-F]/g) ); // xAF
|
||||
```
|
||||
|
||||
Please note that in the word `subject:Exception` there's a substring `subject:xce`. It didn't match the pattern, because the letters are lowercase, while in the set `pattern:[0-9A-F]` they are uppercase.
|
||||
|
||||
If we want to find it too, then we can add a range `a-f`: `pattern:[0-9A-Fa-f]`. The `i` flag would allow lowercase too.
|
||||
|
||||
**Character classes are shorthands for certain character sets.**
|
||||
|
||||
For instance:
|
||||
|
||||
- **\d** -- is the same as `pattern:[0-9]`,
|
||||
- **\w** -- is the same as `pattern:[a-zA-Z0-9_]`,
|
||||
- **\s** -- is the same as `pattern:[\t\n\v\f\r ]` plus few other unicode space characters.
|
||||
|
||||
We can use character classes inside `[…]` as well.
|
||||
|
||||
For instance, we want to match all wordly characters or a dash, for words like "twenty-third". We can't do it with `pattern:\w+`, because `pattern:\w` class does not include a dash. But we can use `pattern:[\w-]`.
|
||||
|
||||
We also can use a combination of classes to cover every possible character, like `pattern:[\s\S]`. That matches spaces or non-spaces -- any character. That's wider than a dot `"."`, because the dot matches any character except a newline.
|
||||
|
||||
## Excluding ranges
|
||||
|
||||
Besides normal ranges, there are "excluding" ranges that look like `pattern:[^…]`.
|
||||
|
||||
They are denoted by a caret character `^` at the start and match any character *except the given ones*.
|
||||
|
||||
For instance:
|
||||
|
||||
- `pattern:[^aeyo]` -- any character except `'a'`, `'e'`, `'y'` or `'o'`.
|
||||
- `pattern:[^0-9]` -- any character except a digit, the same as `\D`.
|
||||
- `pattern:[^\s]` -- any non-space character, same as `\S`.
|
||||
|
||||
The example below looks for any characters except letters, digits and spaces:
|
||||
|
||||
```js run
|
||||
alert( "alice15@gmail.com".match(/[^\d\sA-Z]/gi) ); // @ and .
|
||||
```
|
||||
|
||||
## No escaping in […]
|
||||
|
||||
Usually when we want to find exactly the dot character, we need to escape it like `pattern:\.`. And if we need a backslash, then we use `pattern:\\`.
|
||||
|
||||
In square brackets the vast majority of special characters can be used without escaping:
|
||||
|
||||
- A dot `pattern:'.'`.
|
||||
- A plus `pattern:'+'`.
|
||||
- Parentheses `pattern:'( )'`.
|
||||
- Dash `pattern:'-'` in the beginning or the end (where it does not define a range).
|
||||
- A caret `pattern:'^'` if not in the beginning (where it means exclusion).
|
||||
- And the opening square bracket `pattern:'['`.
|
||||
|
||||
In other words, all special characters are allowed except where they mean something for square brackets.
|
||||
|
||||
A dot `"."` inside square brackets means just a dot. The pattern `pattern:[.,]` would look for one of characters: either a dot or a comma.
|
||||
|
||||
In the example below the regexp `pattern:[-().^+]` looks for one of the characters `-().^+`:
|
||||
|
||||
```js run
|
||||
// No need to escape
|
||||
let reg = /[-().^+]/g;
|
||||
|
||||
alert( "1 + 2 - 3".match(reg) ); // Matches +, -
|
||||
```
|
||||
|
||||
...But if you decide to escape them "just in case", then there would be no harm:
|
||||
|
||||
```js run
|
||||
// Escaped everything
|
||||
let reg = /[\-\(\)\.\^\+]/g;
|
||||
|
||||
alert( "1 + 2 - 3".match(reg) ); // also works: +, -
|
||||
```
|
|
@ -0,0 +1,9 @@
|
|||
|
||||
Solution:
|
||||
|
||||
```js run
|
||||
let reg = /\.{3,}/g;
|
||||
alert( "Hello!... How goes?.....".match(reg) ); // ..., .....
|
||||
```
|
||||
|
||||
Please note that the dot is a special character, so we have to escape it and insert as `\.`.
|
|
@ -0,0 +1,14 @@
|
|||
importance: 5
|
||||
|
||||
---
|
||||
|
||||
# How to find an ellipsis "..." ?
|
||||
|
||||
Create a regexp to find ellipsis: 3 (or more?) dots in a row.
|
||||
|
||||
Check it:
|
||||
|
||||
```js
|
||||
let reg = /your regexp/g;
|
||||
alert( "Hello!... How goes?.....".match(reg) ); // ..., .....
|
||||
```
|
|
@ -0,0 +1,31 @@
|
|||
We need to look for `#` followed by 6 hexadimal characters.
|
||||
|
||||
A hexadimal character can be described as `pattern:[0-9a-fA-F]`. Or if we use the `i` flag, then just `pattern:[0-9a-f]`.
|
||||
|
||||
Then we can look for 6 of them using the quantifier `pattern:{6}`.
|
||||
|
||||
As a result, we have the regexp: `pattern:/#[a-f0-9]{6}/gi`.
|
||||
|
||||
```js run
|
||||
let reg = /#[a-f0-9]{6}/gi;
|
||||
|
||||
let str = "color:#121212; background-color:#AA00ef bad-colors:f#fddee #fd2"
|
||||
|
||||
alert( str.match(reg) ); // #121212,#AA00ef
|
||||
```
|
||||
|
||||
The problem is that it finds the color in longer sequences:
|
||||
|
||||
```js run
|
||||
alert( "#12345678".match( /#[a-f0-9]{6}/gi ) ) // #12345678
|
||||
```
|
||||
|
||||
To fix that, we can add `pattern:\b` to the end:
|
||||
|
||||
```js run
|
||||
// color
|
||||
alert( "#123456".match( /#[a-f0-9]{6}\b/gi ) ); // #123456
|
||||
|
||||
// not a color
|
||||
alert( "#12345678".match( /#[a-f0-9]{6}\b/gi ) ); // null
|
||||
```
|
|
@ -0,0 +1,15 @@
|
|||
# Regexp for HTML colors
|
||||
|
||||
Create a regexp to search HTML-colors written as `#ABCDEF`: first `#` and then 6 hexadimal characters.
|
||||
|
||||
An example of use:
|
||||
|
||||
```js
|
||||
let reg = /...your regexp.../
|
||||
|
||||
let str = "color:#121212; background-color:#AA00ef bad-colors:f#fddee #fd2 #12345678";
|
||||
|
||||
alert( str.match(reg) ) // #121212,#AA00ef
|
||||
```
|
||||
|
||||
P.S. In this task we do not need other color formats like `#123` or `rgb(1,2,3)` etc.
|
140
9-regular-expressions/07-regexp-quantifiers/article.md
Normal file
|
@ -0,0 +1,140 @@
|
|||
# Quantifiers +, *, ? and {n}
|
||||
|
||||
Let's say we have a string like `+7(903)-123-45-67` and want to find all numbers in it. But unlike before, we are interested not in single digits, but full numbers: `7, 903, 123, 45, 67`.
|
||||
|
||||
A number is a sequence of 1 or more digits `\d`. To mark how many we need, we need to append a *quantifier*.
|
||||
|
||||
## Quantity {n}
|
||||
|
||||
The simplest quantifier is a number in curly braces: `pattern:{n}`.
|
||||
|
||||
A quantifier is appended to a character (or a character class, or a `[...]` set etc) and specifies how many we need.
|
||||
|
||||
It has a few advanced forms, let's see examples:
|
||||
|
||||
The exact count: `{5}`
|
||||
: `pattern:\d{5}` denotes exactly 5 digits, the same as `pattern:\d\d\d\d\d`.
|
||||
|
||||
The example below looks for a 5-digit number:
|
||||
|
||||
```js run
|
||||
alert( "I'm 12345 years old".match(/\d{5}/) ); // "12345"
|
||||
```
|
||||
|
||||
We can add `\b` to exclude longer numbers: `pattern:\b\d{5}\b`.
|
||||
|
||||
The range: `{3,5}`, match 3-5 times
|
||||
: To find numbers from 3 to 5 digits we can put the limits into curly braces: `pattern:\d{3,5}`
|
||||
|
||||
```js run
|
||||
alert( "I'm not 12, but 1234 years old".match(/\d{3,5}/) ); // "1234"
|
||||
```
|
||||
|
||||
We can omit the upper limit.
|
||||
|
||||
Then a regexp `pattern:\d{3,}` looks for sequences of digits of length `3` or more:
|
||||
|
||||
```js run
|
||||
alert( "I'm not 12, but 345678 years old".match(/\d{3,}/) ); // "345678"
|
||||
```
|
||||
|
||||
Let's return to the string `+7(903)-123-45-67`.
|
||||
|
||||
A number is a sequence of one or more digits in a row. So the regexp is `pattern:\d{1,}`:
|
||||
|
||||
```js run
|
||||
let str = "+7(903)-123-45-67";
|
||||
|
||||
let numbers = str.match(/\d{1,}/g);
|
||||
|
||||
alert(numbers); // 7,903,123,45,67
|
||||
```
|
||||
|
||||
## Shorthands
|
||||
|
||||
There are shorthands for most used quantifiers:
|
||||
|
||||
`+`
|
||||
: Means "one or more", the same as `{1,}`.
|
||||
|
||||
For instance, `pattern:\d+` looks for numbers:
|
||||
|
||||
```js run
|
||||
let str = "+7(903)-123-45-67";
|
||||
|
||||
alert( str.match(/\d+/g) ); // 7,903,123,45,67
|
||||
```
|
||||
|
||||
`?`
|
||||
: Means "zero or one", the same as `{0,1}`. In other words, it makes the symbol optional.
|
||||
|
||||
For instance, the pattern `pattern:ou?r` looks for `match:o` followed by zero or one `match:u`, and then `match:r`.
|
||||
|
||||
So, `pattern:colou?r` finds both `match:color` and `match:colour`:
|
||||
|
||||
```js run
|
||||
let str = "Should I write color or colour?";
|
||||
|
||||
alert( str.match(/colou?r/g) ); // color, colour
|
||||
```
|
||||
|
||||
`*`
|
||||
: Means "zero or more", the same as `{0,}`. That is, the character may repeat any times or be absent.
|
||||
|
||||
For example, `pattern:\d0*` looks for a digit followed by any number of zeroes:
|
||||
|
||||
```js run
|
||||
alert( "100 10 1".match(/\d0*/g) ); // 100, 10, 1
|
||||
```
|
||||
|
||||
Compare it with `'+'` (one or more):
|
||||
|
||||
```js run
|
||||
alert( "100 10 1".match(/\d0+/g) ); // 100, 10
|
||||
// 1 not matched, as 0+ requires at least one zero
|
||||
```
|
||||
|
||||
## More examples
|
||||
|
||||
Quantifiers are used very often. They serve as the main "building block" of complex regular expressions, so let's see more examples.
|
||||
|
||||
Regexp "decimal fraction" (a number with a floating point): `pattern:\d+\.\d+`
|
||||
: In action:
|
||||
```js run
|
||||
alert( "0 1 12.345 7890".match(/\d+\.\d+/g) ); // 12.345
|
||||
```
|
||||
|
||||
Regexp "open HTML-tag without attributes", like `<span>` or `<p>`: `pattern:/<[a-z]+>/i`
|
||||
: In action:
|
||||
|
||||
```js run
|
||||
alert( "<body> ... </body>".match(/<[a-z]+>/gi) ); // <body>
|
||||
```
|
||||
|
||||
We look for character `pattern:'<'` followed by one or more English letters, and then `pattern:'>'`.
|
||||
|
||||
Regexp "open HTML-tag without attributes" (improved): `pattern:/<[a-z][a-z0-9]*>/i`
|
||||
: Better regexp: according to the standard, HTML tag name may have a digit at any position except the first one, like `<h1>`.
|
||||
|
||||
```js run
|
||||
alert( "<h1>Hi!</h1>".match(/<[a-z][a-z0-9]*>/gi) ); // <h1>
|
||||
```
|
||||
|
||||
Regexp "opening or closing HTML-tag without attributes": `pattern:/<\/?[a-z][a-z0-9]*>/i`
|
||||
: We added an optional slash `pattern:/?` before the tag. Had to escape it with a backslash, otherwise JavaScript would think it is the pattern end.
|
||||
|
||||
```js run
|
||||
alert( "<h1>Hi!</h1>".match(/<\/?[a-z][a-z0-9]*>/gi) ); // <h1>, </h1>
|
||||
```
|
||||
|
||||
```smart header="To make a regexp more precise, we often need make it more complex"
|
||||
We can see one common rule in these examples: the more precise is the regular expression -- the longer and more complex it is.
|
||||
|
||||
For instance, for HTML tags we could use a simpler regexp: `pattern:<\w+>`.
|
||||
|
||||
...But because `pattern:\w` means any English letter or a digit or `'_'`, the regexp also matches non-tags, for instance `match:<_>`. So it's much simpler than `pattern:<[a-z][a-z0-9]*>`, but less reliable.
|
||||
|
||||
Are we ok with `pattern:<\w+>` or we need `pattern:<[a-z][a-z0-9]*>`?
|
||||
|
||||
In real life both variants are acceptable. Depends on how tolerant we can be to "extra" matches and whether it's difficult or not to filter them out by other means.
|
||||
```
|
|
@ -0,0 +1,6 @@
|
|||
|
||||
The result is: `match:123 4`.
|
||||
|
||||
First the lazy `pattern:\d+?` tries to take as little digits as it can, but it has to reach the space, so it takes `match:123`.
|
||||
|
||||
Then the second `\d+?` takes only one digit, because that's enough.
|
|
@ -0,0 +1,7 @@
|
|||
# A match for /d+? d+?/
|
||||
|
||||
What's the match here?
|
||||
|
||||
```js
|
||||
"123 456".match(/\d+? \d+?/g) ); // ?
|
||||
```
|
|
@ -0,0 +1,17 @@
|
|||
We need to find the beginning of the comment `match:<!--`, then everything till the end of `match:-->`.
|
||||
|
||||
The first idea could be `pattern:<!--.*?-->` -- the lazy quantifier makes the dot stop right before `match:-->`.
|
||||
|
||||
But a dot in Javascript means "any symbol except the newline". So multiline comments won't be found.
|
||||
|
||||
We can use `pattern:[\s\S]` instead of the dot to match "anything":
|
||||
|
||||
```js run
|
||||
let reg = /<!--[\s\S]*?-->/g;
|
||||
|
||||
let str = `... <!-- My -- comment
|
||||
test --> .. <!----> ..
|
||||
`;
|
||||
|
||||
alert( str.match(reg) ); // '<!-- My -- comment \n test -->', '<!---->'
|
||||
```
|
|
@ -0,0 +1,13 @@
|
|||
# Find HTML comments
|
||||
|
||||
Find all HTML comments in the text:
|
||||
|
||||
```js
|
||||
let reg = /your regexp/g;
|
||||
|
||||
let str = `... <!-- My -- comment
|
||||
test --> .. <!----> ..
|
||||
`;
|
||||
|
||||
alert( str.match(reg) ); // '<!-- My -- comment \n test -->', '<!---->'
|
||||
```
|
|
@ -0,0 +1,10 @@
|
|||
|
||||
The solution is `pattern:<[^<>]+>`.
|
||||
|
||||
```js run
|
||||
let reg = /<[^<>]+>/g;
|
||||
|
||||
let str = '<> <a href="/"> <input type="radio" checked> <b>';
|
||||
|
||||
alert( str.match(reg) ); // '<a href="/">', '<input type="radio" checked>', '<b>'
|
||||
```
|
|
@ -0,0 +1,15 @@
|
|||
# Find HTML tags
|
||||
|
||||
Create a regular expression to find all (opening and closing) HTML tags with their attributes.
|
||||
|
||||
An example of use:
|
||||
|
||||
```js run
|
||||
let reg = /your regexp/g;
|
||||
|
||||
let str = '<> <a href="/"> <input type="radio" checked> <b>';
|
||||
|
||||
alert( str.match(reg) ); // '<a href="/">', '<input type="radio" checked>', '<b>'
|
||||
```
|
||||
|
||||
Let's assume that may not contain `<` and `>` inside (in quotes too), that simplifies things a bit.
|
304
9-regular-expressions/08-regexp-greedy-and-lazy/article.md
Normal file
|
@ -0,0 +1,304 @@
|
|||
# Greedy and lazy quantifiers
|
||||
|
||||
Quantifiers are very simple from the first sight, but in fact they can be tricky.
|
||||
|
||||
We should understand how the search works very well if we plan to look for something more complex than `pattern:/\d+/`.
|
||||
|
||||
Let's take the following task as an example.
|
||||
|
||||
We have a text and need to replace all quotes `"..."` with guillemet marks: `«...»`. They are preferred for typography in many countries.
|
||||
|
||||
For instance: `"Hello, world"` should become `«Hello, world»`. Some countries prefer other quotes, like `„Witam, świat!”` (Polish) or `「你好,世界」` (Chinese), but for our task let's choose `«...»`.
|
||||
|
||||
The first thing to do is to locate quoted strings, and then we can replace them.
|
||||
|
||||
A regular expression like `pattern:/".+"/g` (a quote, then something, then the other quote) may seem like a good fit, but it isn't!
|
||||
|
||||
Let's try it:
|
||||
|
||||
```js run
|
||||
let reg = /".+"/g;
|
||||
|
||||
let str = 'a "witch" and her "broom" is one';
|
||||
|
||||
alert( str.match(reg) ); // "witch" and her "broom"
|
||||
```
|
||||
|
||||
...We can see that it works not as intended!
|
||||
|
||||
Instead of finding two matches `match:"witch"` and `match:"broom"`, it finds one: `match:"witch" and her "broom"`.
|
||||
|
||||
That can be described as "greediness is the cause of all evil".
|
||||
|
||||
## Greedy search
|
||||
|
||||
To find a match, the regular expression engine uses the following algorithm:
|
||||
|
||||
- For every position in the string
|
||||
- Match the pattern at that position.
|
||||
- If there's no match, go to the next position.
|
||||
|
||||
These common words do not make it obvious why the regexp fails, so let's elaborate how the search works for the pattern `pattern:".+"`.
|
||||
|
||||
1. The first pattern character is a quote `pattern:"`.
|
||||
|
||||
The regular expression engine tries to find it at the zero position of the source string `subject:a "witch" and her "broom" is one`, but there's `subject:a` there, so there's immediately no match.
|
||||
|
||||
Then it advances: goes to the next positions in the source string and tries to find the first character of the pattern there, and finally finds the quote at the 3rd position:
|
||||
|
||||

|
||||
|
||||
2. The quote is detected, and then the engine tries to find a match for the rest of the pattern. It tries to see if the rest of the subject string conforms to `pattern:.+"`.
|
||||
|
||||
In our case the next pattern character is `pattern:.` (a dot). It denotes "any character except a newline", so the next string letter `match:'w'` fits:
|
||||
|
||||

|
||||
|
||||
3. Then the dot repeats because of the quantifier `pattern:.+`. The regular expression engine builds the match by taking characters one by one while it is possible.
|
||||
|
||||
...When it becomes impossible? All characters match the dot, so it only stops when it reaches the end of the string:
|
||||
|
||||

|
||||
|
||||
4. Now the engine finished repeating for `pattern:.+` and tries to find the next character of the pattern. It's the quote `pattern:"`. But there's a problem: the string has finished, there are no more characters!
|
||||
|
||||
The regular expression engine understands that it took too many `pattern:.+` and starts to *backtrack*.
|
||||
|
||||
In other words, it shortens the match for the quantifier by one character:
|
||||
|
||||

|
||||
|
||||
Now it assumes that `pattern:.+` ends one character before the end and tries to match the rest of the pattern from that position.
|
||||
|
||||
If there were a quote there, then that would be the end, but the last character is `subject:'e'`, so there's no match.
|
||||
|
||||
5. ...So the engine decreases the number of repetitions of `pattern:.+` by one more character:
|
||||
|
||||

|
||||
|
||||
The quote `pattern:'"'` does not match `subject:'n'`.
|
||||
|
||||
6. The engine keep backtracking: it decreases the count of repetition for `pattern:'.'` until the rest of the pattern (in our case `pattern:'"'`) matches:
|
||||
|
||||

|
||||
|
||||
7. The match is complete.
|
||||
|
||||
8. So the first match is `match:"witch" and her "broom"`. The further search starts where the first match ends, but there are no more quotes in the rest of the string `subject:is one`, so no more results.
|
||||
|
||||
That's probably not what we expected, but that's how it works.
|
||||
|
||||
**In the greedy mode (by default) the quantifier is repeated as many times as possible.**
|
||||
|
||||
The regexp engine tries to fetch as many characters as it can by `pattern:.+`, and then shortens that one by one.
|
||||
|
||||
For our task we want another thing. That's what the lazy quantifier mode is for.
|
||||
|
||||
## Lazy mode
|
||||
|
||||
The lazy mode of quantifier is an opposite to the greedy mode. It means: "repeat minimal number of times".
|
||||
|
||||
We can enable it by putting a question mark `pattern:'?'` after the quantifier, so that it becomes `pattern:*?` or `pattern:+?` or even `pattern:??` for `pattern:'?'`.
|
||||
|
||||
To make things clear: usually a question mark `pattern:?` is a quantifier by itself (zero or one), but if added *after another quantifier (or even itself)* it gets another meaning -- it switches the matching mode from greedy to lazy.
|
||||
|
||||
The regexp `pattern:/".+?"/g` works as intended: it finds `match:"witch"` and `match:"broom"`:
|
||||
|
||||
```js run
|
||||
let reg = /".+?"/g;
|
||||
|
||||
let str = 'a "witch" and her "broom" is one';
|
||||
|
||||
alert( str.match(reg) ); // witch, broom
|
||||
```
|
||||
|
||||
To clearly understand the change, let's trace the search step by step.
|
||||
|
||||
1. The first step is the same: it finds the pattern start `pattern:'"'` at the 3rd position:
|
||||
|
||||

|
||||
|
||||
2. The next step is also similar: the engine finds a match for the dot `pattern:'.'`:
|
||||
|
||||

|
||||
|
||||
3. And now the search goes differently. Because we have a lazy mode for `pattern:+?`, the engine doesn't try to match a dot one more time, but stops and tries to match the rest of the pattern `pattern:'"'` right now:
|
||||
|
||||

|
||||
|
||||
If there were a quote there, then the search would end, but there's `'i'`, so there's no match.
|
||||
4. Then the regular expression engine increases the number of repetitions for the dot and tries one more time:
|
||||
|
||||

|
||||
|
||||
Failure again. Then the number of repetitions is increased again and again...
|
||||
5. ...Till the match for the rest of the pattern is found:
|
||||
|
||||

|
||||
|
||||
6. The next search starts from the end of the current match and yield one more result:
|
||||
|
||||

|
||||
|
||||
In this example we saw how the lazy mode works for `pattern:+?`. Quantifiers `pattern:+?` and `pattern:??` work the similar way -- the regexp engine increases the number of repetitions only if the rest of the pattern can't match on the given position.
|
||||
|
||||
**Laziness is only enabled for the quantifier with `?`.**
|
||||
|
||||
Other quantifiers remain greedy.
|
||||
|
||||
For instance:
|
||||
|
||||
```js run
|
||||
alert( "123 456".match(/\d+ \d+?/g) ); // 123 4
|
||||
```
|
||||
|
||||
1. The pattern `pattern:\d+` tries to match as many numbers as it can (greedy mode), so it finds `match:123` and stops, because the next character is a space `pattern:' '`.
|
||||
2. Then there's a space in pattern, it matches.
|
||||
3. Then there's `pattern:\d+?`. The quantifier is in lazy mode, so it finds one digit `match:4` and tries to check if the rest of the pattern matches from there.
|
||||
|
||||
...But there's nothing in the pattern after `pattern:\d+?`.
|
||||
|
||||
The lazy mode doesn't repeat anything without a need. The pattern finished, so we're done. We have a match `match:123 4`.
|
||||
4. The next search starts from the character `5`.
|
||||
|
||||
```smart header="Optimizations"
|
||||
Modern regular expression engines can optimize internal algorithms to work faster. So they may work a bit different from the described algorithm.
|
||||
|
||||
But to understand how regular expressions work and to build regular expressions, we don't need to know about that. They are only used internally to optimize things.
|
||||
|
||||
Complex regular expressions are hard to optimize, so the search may work exactly as described as well.
|
||||
```
|
||||
|
||||
## Alternative approach
|
||||
|
||||
With regexps, there's often more than one way to do the same thing.
|
||||
|
||||
In our case we can find quoted strings without lazy mode using the regexp `pattern:"[^"]+"`:
|
||||
|
||||
```js run
|
||||
let reg = /"[^"]+"/g;
|
||||
|
||||
let str = 'a "witch" and her "broom" is one';
|
||||
|
||||
alert( str.match(reg) ); // witch, broom
|
||||
```
|
||||
|
||||
The regexp `pattern:"[^"]+"` gives correct results, because it looks for a quote `pattern:'"'` followed by one or more non-quotes `pattern:[^"]`, and then the closing quote.
|
||||
|
||||
When the regexp engine looks for `pattern:[^"]+` it stops the repetitions when it meets the closing quote, and we're done.
|
||||
|
||||
Please note, that this logic does not replace lazy quantifiers!
|
||||
|
||||
It is just different. There are times when we need one or another.
|
||||
|
||||
**Let's see an example where lazy quantifiers fail and this variant works right.**
|
||||
|
||||
For instance, we want to find links of the form `<a href="..." class="doc">`, with any `href`.
|
||||
|
||||
Which regular expression to use?
|
||||
|
||||
The first idea might be: `pattern:/<a href=".*" class="doc">/g`.
|
||||
|
||||
Let's check it:
|
||||
```js run
|
||||
let str = '...<a href="link" class="doc">...';
|
||||
let reg = /<a href=".*" class="doc">/g;
|
||||
|
||||
// Works!
|
||||
alert( str.match(reg) ); // <a href="link" class="doc">
|
||||
```
|
||||
|
||||
It worked. But let's see what happens if there are many links in the text?
|
||||
|
||||
```js run
|
||||
let str = '...<a href="link1" class="doc">... <a href="link2" class="doc">...';
|
||||
let reg = /<a href=".*" class="doc">/g;
|
||||
|
||||
// Whoops! Two links in one match!
|
||||
alert( str.match(reg) ); // <a href="link1" class="doc">... <a href="link2" class="doc">
|
||||
```
|
||||
|
||||
Now the result is wrong for the same reason as our "witches" example. The quantifier `pattern:.*` took too many characters.
|
||||
|
||||
The match looks like this:
|
||||
|
||||
```html
|
||||
<a href="....................................." class="doc">
|
||||
<a href="link1" class="doc">... <a href="link2" class="doc">
|
||||
```
|
||||
|
||||
Let's modify the pattern by making the quantifier `pattern:.*?` lazy:
|
||||
|
||||
```js run
|
||||
let str = '...<a href="link1" class="doc">... <a href="link2" class="doc">...';
|
||||
let reg = /<a href=".*?" class="doc">/g;
|
||||
|
||||
// Works!
|
||||
alert( str.match(reg) ); // <a href="link1" class="doc">, <a href="link2" class="doc">
|
||||
```
|
||||
|
||||
Now it seems to work, there are two matches:
|
||||
|
||||
```html
|
||||
<a href="....." class="doc"> <a href="....." class="doc">
|
||||
<a href="link1" class="doc">... <a href="link2" class="doc">
|
||||
```
|
||||
|
||||
...But let's test it on one more text input:
|
||||
|
||||
```js run
|
||||
let str = '...<a href="link1" class="wrong">... <p style="" class="doc">...';
|
||||
let reg = /<a href=".*?" class="doc">/g;
|
||||
|
||||
// Wrong match!
|
||||
alert( str.match(reg) ); // <a href="link1" class="wrong">... <p style="" class="doc">
|
||||
```
|
||||
|
||||
Now it fails. The match includes not just a link, but also a lot of text after it, including `<p...>`.
|
||||
|
||||
Why?
|
||||
|
||||
That's what's going on:
|
||||
|
||||
1. First the regexp finds a link start `match:<a href="`.
|
||||
2. Then it looks for `pattern:.*?`: takes one character (lazily!), check if there's a match for `pattern:" class="doc">` (none).
|
||||
3. Then takes another character into `pattern:.*?`, and so on... until it finally reaches `match:" class="doc">`.
|
||||
|
||||
But the problem is: that's already beyound the link, in another tag `<p>`. Not what we want.
|
||||
|
||||
Here's the picture of the match aligned with the text:
|
||||
|
||||
```html
|
||||
<a href="..................................." class="doc">
|
||||
<a href="link1" class="wrong">... <p style="" class="doc">
|
||||
```
|
||||
|
||||
So the laziness did not work for us here.
|
||||
|
||||
We need the pattern to look for `<a href="...something..." class="doc">`, but both greedy and lazy variants have problems.
|
||||
|
||||
The correct variant would be: `pattern:href="[^"]*"`. It will take all characters inside the `href` attribute till the nearest quote, just what we need.
|
||||
|
||||
A working example:
|
||||
|
||||
```js run
|
||||
let str1 = '...<a href="link1" class="wrong">... <p style="" class="doc">...';
|
||||
let str2 = '...<a href="link1" class="doc">... <a href="link2" class="doc">...';
|
||||
let reg = /<a href="[^"]*" class="doc">/g;
|
||||
|
||||
// Works!
|
||||
alert( str1.match(reg) ); // null, no matches, that's correct
|
||||
alert( str2.match(reg) ); // <a href="link1" class="doc">, <a href="link2" class="doc">
|
||||
```
|
||||
|
||||
## Summary
|
||||
|
||||
Quantifiers have two modes of work:
|
||||
|
||||
Greedy
|
||||
: By default the regular expression engine tries to repeat the quantifier as many times as possible. For instance, `pattern:\d+` consumes all possible digits. When it becomes impossible to consume more (no more digits or string end), then it continues to match the rest of the pattern. If there's no match then it decreases the number of repetitions (backtracks) and tries again.
|
||||
|
||||
Lazy
|
||||
: Enabled by the question mark `pattern:?` after the quantifier. The regexp engine tries to match the rest of the pattern before each repetition of the quantifier.
|
||||
|
||||
As we've seen, the lazy mode is not a "panacea" from the greedy search. An alternative is a "fine-tuned" greedy search, with exclusions. Soon we'll see more examples of it.
|
After Width: | Height: | Size: 8.1 KiB |
After Width: | Height: | Size: 16 KiB |
After Width: | Height: | Size: 7.8 KiB |
After Width: | Height: | Size: 15 KiB |
After Width: | Height: | Size: 10 KiB |
After Width: | Height: | Size: 20 KiB |
After Width: | Height: | Size: 10 KiB |
After Width: | Height: | Size: 20 KiB |
After Width: | Height: | Size: 10 KiB |
After Width: | Height: | Size: 20 KiB |
After Width: | Height: | Size: 9.6 KiB |
After Width: | Height: | Size: 18 KiB |
BIN
9-regular-expressions/08-regexp-greedy-and-lazy/witch_lazy3.png
Normal file
After Width: | Height: | Size: 7.9 KiB |
After Width: | Height: | Size: 15 KiB |
BIN
9-regular-expressions/08-regexp-greedy-and-lazy/witch_lazy4.png
Normal file
After Width: | Height: | Size: 8.1 KiB |
After Width: | Height: | Size: 15 KiB |
BIN
9-regular-expressions/08-regexp-greedy-and-lazy/witch_lazy5.png
Normal file
After Width: | Height: | Size: 8.3 KiB |
After Width: | Height: | Size: 16 KiB |
BIN
9-regular-expressions/08-regexp-greedy-and-lazy/witch_lazy6.png
Normal file
After Width: | Height: | Size: 9.1 KiB |
After Width: | Height: | Size: 17 KiB |
|
@ -0,0 +1,29 @@
|
|||
A regexp to search 3-digit color `#abc`: `pattern:/#[a-f0-9]{3}/i`.
|
||||
|
||||
We can add exactly 3 more optional hex digits. We don't need more or less. Either we have them or we don't.
|
||||
|
||||
The simplest way to add them -- is to append to the regexp: `pattern:/#[a-f0-9]{3}([a-f0-9]{3})?/i`
|
||||
|
||||
We can do it in a smarter way though: `pattern:/#([a-f0-9]{3}){1,2}/i`.
|
||||
|
||||
Here the regexp `pattern:[a-f0-9]{3}` is in parentheses to apply the quantifier `pattern:{1,2}` to it as a whole.
|
||||
|
||||
In action:
|
||||
|
||||
```js run
|
||||
let reg = /#([a-f0-9]{3}){1,2}/gi;
|
||||
|
||||
let str = "color: #3f3; background-color: #AA00ef; and: #abcd";
|
||||
|
||||
alert( str.match(reg) ); // #3f3 #AA00ef #abc
|
||||
```
|
||||
|
||||
There's a minor problem here: the pattern found `match:#abc` in `subject:#abcd`. To prevent that we can add `pattern:\b` to the end:
|
||||
|
||||
```js run
|
||||
let reg = /#([a-f0-9]{3}){1,2}\b/gi;
|
||||
|
||||
let str = "color: #3f3; background-color: #AA00ef; and: #abcd";
|
||||
|
||||
alert( str.match(reg) ); // #3f3 #AA00ef
|
||||
```
|
|
@ -0,0 +1,14 @@
|
|||
# Find color in the format #abc or #abcdef
|
||||
|
||||
Write a RegExp that matches colors in the format `#abc` or `#abcdef`. That is: `#` followed by 3 or 6 hexadecimal digits.
|
||||
|
||||
Usage example:
|
||||
```js
|
||||
let reg = /your regexp/g;
|
||||
|
||||
let str = "color: #3f3; background-color: #AA00ef; and: #abcd";
|
||||
|
||||
alert( str.match(reg) ); // #3f3 #AA00ef
|
||||
```
|
||||
|
||||
P.S. This should be exactly 3 or 6 hex digits: values like `#abcd` should not match.
|
|
@ -0,0 +1,18 @@
|
|||
|
||||
An non-negative integer number is `pattern:\d+`. We should exclude `0` as the first digit, as we don't need zero, but we can allow it in further digits.
|
||||
|
||||
So that gives us `pattern:[1-9]\d*`.
|
||||
|
||||
A decimal part is: `pattern:\.\d+`.
|
||||
|
||||
Because the decimal part is optional, let's put it in parentheses with the quantifier `pattern:'?'`.
|
||||
|
||||
Finally we have the regexp: `pattern:[1-9]\d*(\.\d+)?`:
|
||||
|
||||
```js run
|
||||
let reg = /[1-9]\d*(\.\d+)?/g;
|
||||
|
||||
let str = "1.5 0 -5 12. 123.4.";
|
||||
|
||||
alert( str.match(reg) ); // 1.5, 0, 12, 123.4
|
||||
```
|
|
@ -0,0 +1,12 @@
|
|||
# Find positive numbers
|
||||
|
||||
Create a regexp that looks for positive numbers, including those without a decimal point.
|
||||
|
||||
An example of use:
|
||||
```js
|
||||
let reg = /your regexp/g;
|
||||
|
||||
let str = "1.5 0 -5 12. 123.4.";
|
||||
|
||||
alert( str.match(reg) ); // 1.5, 12, 123.4 (ignores 0 and -5)
|
||||
```
|
|
@ -0,0 +1,11 @@
|
|||
A positive number with an optional decimal part is (per previous task): `pattern:\d+(\.\d+)?`.
|
||||
|
||||
Let's add an optional `-` in the beginning:
|
||||
|
||||
```js run
|
||||
let reg = /-?\d+(\.\d+)?/g;
|
||||
|
||||
let str = "-1.5 0 2 -123.4.";
|
||||
|
||||
alert( str.match(reg) ); // -1.5, 0, 2, -123.4
|
||||
```
|
|
@ -0,0 +1,13 @@
|
|||
# Find all numbers
|
||||
|
||||
Write a regexp that looks for all decimal numbers including integer ones, with the floating point and negative ones.
|
||||
|
||||
An example of use:
|
||||
|
||||
```js
|
||||
let reg = /your regexp/g;
|
||||
|
||||
let str = "-1.5 0 2 -123.4.";
|
||||
|
||||
alert( str.match(re) ); // -1.5, 0, 2, -123.4
|
||||
```
|
|
@ -0,0 +1,51 @@
|
|||
A regexp for a number is: `pattern:-?\d+(\.\d+)?`. We created it in previous tasks.
|
||||
|
||||
An operator is `pattern:[-+*/]`. We put the dash `pattern:-` first, because in the middle it would mean a character range, we don't need that.
|
||||
|
||||
Note that a slash should be escaped inside a JavaScript regexp `pattern:/.../`.
|
||||
|
||||
We need a number, an operator, and then another number. And optional spaces between them.
|
||||
|
||||
The full regular expression: `pattern:-?\d+(\.\d+)?\s*[-+*/]\s*-?\d+(\.\d+)?`.
|
||||
|
||||
To get a result as an array let's put parentheses around the data that we need: numbers and the operator: `pattern:(-?\d+(\.\d+)?)\s*([-+*/])\s*(-?\d+(\.\d+)?)`.
|
||||
|
||||
In action:
|
||||
|
||||
```js run
|
||||
let reg = /(-?\d+(\.\d+)?)\s*([-+*\/])\s*(-?\d+(\.\d+)?)/;
|
||||
|
||||
alert( "1.2 + 12".match(reg) );
|
||||
```
|
||||
|
||||
The result includes:
|
||||
|
||||
- `result[0] == "1.2 + 12"` (full match)
|
||||
- `result[1] == "1.2"` (first group `(-?\d+(\.\d+)?)` -- the first number, including the decimal part)
|
||||
- `result[2] == ".2"` (second group`(\.\d+)?` -- the first decimal part)
|
||||
- `result[3] == "+"` (third group `([-+*\/])` -- the operator)
|
||||
- `result[4] == "12"` (forth group `(-?\d+(\.\d+)?)` -- the second number)
|
||||
- `result[5] == undefined` (fifth group `(\.\d+)?` -- the last decimal part is absent, so it's undefined)
|
||||
|
||||
We only want the numbers and the operator, without the full match or the decimal parts.
|
||||
|
||||
The full match (the arrays first item) can be removed by shifting the array `pattern:result.shift()`.
|
||||
|
||||
The decimal groups can be removed by making them into non-capturing groups, by adding `pattern:?:` to the beginning: `pattern:(?:\.\d+)?`.
|
||||
|
||||
The final solution:
|
||||
|
||||
```js run
|
||||
function parse(expr) {
|
||||
let reg = /(-?\d+(?:\.\d+)?)\s*([-+*\/])\s*(-?\d+(?:\.\d+)?)/;
|
||||
|
||||
let result = expr.match(reg);
|
||||
|
||||
if (!result) return [];
|
||||
result.shift();
|
||||
|
||||
return result;
|
||||
}
|
||||
|
||||
alert( parse("-1.23 * 3.45") ); // -1.23, *, 3.45
|
||||
```
|
|
@ -0,0 +1,28 @@
|
|||
# Parse an expression
|
||||
|
||||
An arithmetical expression consists of 2 numbers and an operator between them, for instance:
|
||||
|
||||
- `1 + 2`
|
||||
- `1.2 * 3.4`
|
||||
- `-3 / -6`
|
||||
- `-2 - 2`
|
||||
|
||||
The operator is one of: `"+"`, `"-"`, `"*"` or `"/"`.
|
||||
|
||||
There may be extra spaces at the beginning, at the end or between the parts.
|
||||
|
||||
Create a function `parse(expr)` that takes an expression and returns an array of 3 items:
|
||||
|
||||
1. The first number.
|
||||
2. The operator.
|
||||
3. The second number.
|
||||
|
||||
For example:
|
||||
|
||||
```js
|
||||
let [a, op, b] = parse("1.2 * 3.4");
|
||||
|
||||
alert(a); // 1.2
|
||||
alert(op); // *
|
||||
alert(b); // 3.4
|
||||
```
|
237
9-regular-expressions/09-regexp-groups/article.md
Normal file
|
@ -0,0 +1,237 @@
|
|||
# Capturing groups
|
||||
|
||||
A part of a pattern can be enclosed in parentheses `pattern:(...)`. This is called a "capturing group".
|
||||
|
||||
That has two effects:
|
||||
|
||||
1. It allows to place a part of the match into a separate array.
|
||||
2. If we put a quantifier after the parentheses, it applies to the parentheses as a whole, not the last character.
|
||||
|
||||
## Example
|
||||
|
||||
In the example below the pattern `pattern:(go)+` finds one or more `match:'go'`:
|
||||
|
||||
```js run
|
||||
alert( 'Gogogo now!'.match(/(go)+/i) ); // "Gogogo"
|
||||
```
|
||||
|
||||
Without parentheses, the pattern `pattern:/go+/` means `subject:g`, followed by `subject:o` repeated one or more times. For instance, `match:goooo` or `match:gooooooooo`.
|
||||
|
||||
Parentheses group the word `pattern:(go)` together.
|
||||
|
||||
Let's make something more complex -- a regexp to match an email.
|
||||
|
||||
Examples of emails:
|
||||
|
||||
```
|
||||
my@mail.com
|
||||
john.smith@site.com.uk
|
||||
```
|
||||
|
||||
The pattern: `pattern:[-.\w]+@([\w-]+\.)+[\w-]{2,20}`.
|
||||
|
||||
1. The first part `pattern:[-.\w]+` (before `@`) may include any alphanumeric word characters, a dot and a dash, to match `match:john.smith`.
|
||||
2. Then `pattern:@`, and the domain. It may be a subdomain like `host.site.com.uk`, so we match it as "a word followed by a dot `pattern:([\w-]+\.)` (repeated), and then the last part must be a word: `match:com` or `match:uk` (but not very long: 2-20 characters).
|
||||
|
||||
That regexp is not perfect, but good enough to fix errors or occasional mistypes.
|
||||
|
||||
For instance, we can find all emails in the string:
|
||||
|
||||
```js run
|
||||
let reg = /[-.\w]+@([\w-]+\.)+[\w-]{2,20}/g;
|
||||
|
||||
alert("my@mail.com @ his@site.com.uk".match(reg)); // my@mail.com, his@site.com.uk
|
||||
```
|
||||
|
||||
In this example parentheses were used to make a group for repeating `pattern:(...)+`. But there are other uses too, let's see them.
|
||||
|
||||
## Contents of parentheses
|
||||
|
||||
Parentheses are numbered from left to right. The search engine remembers the content of each and allows to reference it in the pattern or in the replacement string.
|
||||
|
||||
For instance, we'd like to find HTML tags `pattern:<.*?>`, and process them.
|
||||
|
||||
Let's wrap the inner content into parentheses, like this: `pattern:<(.*?)>`.
|
||||
|
||||
We'll get them into an array:
|
||||
|
||||
```js run
|
||||
let str = '<h1>Hello, world!</h1>';
|
||||
let reg = /<(.*?)>/;
|
||||
|
||||
alert( str.match(reg) ); // Array: ["<h1>", "h1"]
|
||||
```
|
||||
|
||||
The call to [String#match](mdn:js/String/match) returns groups only if the regexp has no `pattern:/.../g` flag.
|
||||
|
||||
If we need all matches with their groups then we can use `.matchAll` or `regexp.exec` as described in <info:regexp-methods>:
|
||||
|
||||
```js run
|
||||
let str = '<h1>Hello, world!</h1>';
|
||||
|
||||
// two matches: opening <h1> and closing </h1> tags
|
||||
let reg = /<(.*?)>/g;
|
||||
|
||||
let matches = Array.from( str.matchAll(reg) );
|
||||
|
||||
alert(matches[0]); // Array: ["<h1>", "h1"]
|
||||
alert(matches[1]); // Array: ["</h1>", "/h1"]
|
||||
```
|
||||
|
||||
Here we have two matches for `pattern:<(.*?)>`, each of them is an array with the full match and groups.
|
||||
|
||||
## Nested groups
|
||||
|
||||
Parentheses can be nested. In this case the numbering also goes from left to right.
|
||||
|
||||
For instance, when searching a tag in `subject:<span class="my">` we may be interested in:
|
||||
|
||||
1. The tag content as a whole: `match:span class="my"`.
|
||||
2. The tag name: `match:span`.
|
||||
3. The tag attributes: `match:class="my"`.
|
||||
|
||||
Let's add parentheses for them:
|
||||
|
||||
```js run
|
||||
let str = '<span class="my">';
|
||||
|
||||
let reg = /<(([a-z]+)\s*([^>]*))>/;
|
||||
|
||||
let result = str.match(reg);
|
||||
alert(result); // <span class="my">, span class="my", span, class="my"
|
||||
```
|
||||
|
||||
Here's how groups look:
|
||||
|
||||

|
||||
|
||||
At the zero index of the `result` is always the full match.
|
||||
|
||||
Then groups, numbered from left to right. Whichever opens first gives the first group `result[1]`. Here it encloses the whole tag content.
|
||||
|
||||
Then in `result[2]` goes the group from the second opening `pattern:(` till the corresponding `pattern:)` -- tag name, then we don't group spaces, but group attributes for `result[3]`.
|
||||
|
||||
**If a group is optional and doesn't exist in the match, the corresponding `result` index is present (and equals `undefined`).**
|
||||
|
||||
For instance, let's consider the regexp `pattern:a(z)?(c)?`. It looks for `"a"` optionally followed by `"z"` optionally followed by `"c"`.
|
||||
|
||||
If we run it on the string with a single letter `subject:a`, then the result is:
|
||||
|
||||
```js run
|
||||
let match = 'a'.match(/a(z)?(c)?/);
|
||||
|
||||
alert( match.length ); // 3
|
||||
alert( match[0] ); // a (whole match)
|
||||
alert( match[1] ); // undefined
|
||||
alert( match[2] ); // undefined
|
||||
```
|
||||
|
||||
The array has the length of `3`, but all groups are empty.
|
||||
|
||||
And here's a more complex match for the string `subject:ack`:
|
||||
|
||||
```js run
|
||||
let match = 'ack'.match(/a(z)?(c)?/)
|
||||
|
||||
alert( match.length ); // 3
|
||||
alert( match[0] ); // ac (whole match)
|
||||
alert( match[1] ); // undefined, because there's nothing for (z)?
|
||||
alert( match[2] ); // c
|
||||
```
|
||||
|
||||
The array length is permanent: `3`. But there's nothing for the group `pattern:(z)?`, so the result is `["ac", undefined, "c"]`.
|
||||
|
||||
## Named groups
|
||||
|
||||
Remembering groups by their numbers is hard. For simple patterns it's doable, but for more complex ones we can give names to parentheses.
|
||||
|
||||
That's done by putting `pattern:?<name>` immediately after the opening paren, like this:
|
||||
|
||||
```js run
|
||||
*!*
|
||||
let dateRegexp = /(?<year>[0-9]{4})-(?<month>[0-9]{2})-(?<day>[0-9]{2})/;
|
||||
*/!*
|
||||
let str = "2019-04-30";
|
||||
|
||||
let groups = str.match(dateRegexp).groups;
|
||||
|
||||
alert(groups.year); // 2019
|
||||
alert(groups.month); // 04
|
||||
alert(groups.day); // 30
|
||||
```
|
||||
|
||||
As you can see, the groups reside in the `.groups` property of the match.
|
||||
|
||||
Wee can also use them in replacements, as `pattern:$<name>` (like `$1..9`, but name instead of a digit).
|
||||
|
||||
For instance, let's rearrange the date into `day.month.year`:
|
||||
|
||||
```js run
|
||||
let dateRegexp = /(?<year>[0-9]{4})-(?<month>[0-9]{2})-(?<day>[0-9]{2})/;
|
||||
|
||||
let str = "2019-04-30";
|
||||
|
||||
let rearranged = str.replace(dateRegexp, '$<day>.$<month>.$<year>');
|
||||
|
||||
alert(rearranged); // 30.04.2019
|
||||
```
|
||||
|
||||
If we use a function, then named `groups` object is always the last argument:
|
||||
|
||||
```js run
|
||||
let dateRegexp = /(?<year>[0-9]{4})-(?<month>[0-9]{2})-(?<day>[0-9]{2})/;
|
||||
|
||||
let str = "2019-04-30";
|
||||
|
||||
let rearranged = str.replace(dateRegexp,
|
||||
(str, year, month, day, offset, input, groups) =>
|
||||
`${groups.day}.${groups.month}.${groups.year}`
|
||||
);
|
||||
|
||||
alert(rearranged); // 30.04.2019
|
||||
```
|
||||
|
||||
Usually, when we intend to use named groups, we don't need positional arguments of the function. For the majority of real-life cases we only need `str` and `groups`.
|
||||
|
||||
So we can write it a little bit shorter:
|
||||
|
||||
```js
|
||||
let rearranged = str.replace(dateRegexp, (str, ...args) => {
|
||||
let {year, month, day} = args.pop();
|
||||
alert(str); // 2019-04-30
|
||||
alert(year); // 2019
|
||||
alert(month); // 04
|
||||
alert(day); // 30
|
||||
});
|
||||
```
|
||||
|
||||
|
||||
## Non-capturing groups with ?:
|
||||
|
||||
Sometimes we need parentheses to correctly apply a quantifier, but we don't want the contents in results.
|
||||
|
||||
A group may be excluded by adding `pattern:?:` in the beginning.
|
||||
|
||||
For instance, if we want to find `pattern:(go)+`, but don't want to remember the contents (`go`) in a separate array item, we can write: `pattern:(?:go)+`.
|
||||
|
||||
In the example below we only get the name "John" as a separate member of the `results` array:
|
||||
|
||||
```js run
|
||||
let str = "Gogo John!";
|
||||
*!*
|
||||
// exclude Gogo from capturing
|
||||
let reg = /(?:go)+ (\w+)/i;
|
||||
*/!*
|
||||
|
||||
let result = str.match(reg);
|
||||
|
||||
alert( result.length ); // 2
|
||||
alert( result[1] ); // John
|
||||
```
|
||||
|
||||
## Summary
|
||||
|
||||
- Parentheses can be:
|
||||
- capturing `(...)`, ordered left-to-right, accessible by number.
|
||||
- named capturing `(?<name>...)`, accessible by name.
|
||||
- non-capturing `(?:...)`, used only to apply quantifier to the whole groups.
|
BIN
9-regular-expressions/09-regexp-groups/regexp-nested-groups.png
Normal file
After Width: | Height: | Size: 12 KiB |
After Width: | Height: | Size: 25 KiB |
65
9-regular-expressions/10-regexp-backreferences/article.md
Normal file
|
@ -0,0 +1,65 @@
|
|||
# Backreferences in pattern: \n and \k
|
||||
|
||||
Capturing groups can be accessed not only in the result or in the replacement string, but also in the pattern itself.
|
||||
|
||||
## Backreference by number: \n
|
||||
|
||||
A group can be referenced in the pattern using `\n`, where `n` is the group number.
|
||||
|
||||
To make things clear let's consider a task.
|
||||
|
||||
We need to find a quoted string: either a single-quoted `subject:'...'` or a double-quoted `subject:"..."` -- both variants need to match.
|
||||
|
||||
How to look for them?
|
||||
|
||||
We can put two kinds of quotes in the pattern: `pattern:['"](.*?)['"]`, but it would find strings with mixed quotes, like `match:"...'` and `match:'..."`. That would lead to incorrect matches when one quote appears inside other ones, like the string `subject:"She's the one!"`:
|
||||
|
||||
```js run
|
||||
let str = `He said: "She's the one!".`;
|
||||
|
||||
let reg = /['"](.*?)['"]/g;
|
||||
|
||||
// The result is not what we expect
|
||||
alert( str.match(reg) ); // "She'
|
||||
```
|
||||
|
||||
As we can see, the pattern found an opening quote `match:"`, then the text is consumed lazily till the other quote `match:'`, that closes the match.
|
||||
|
||||
To make sure that the pattern looks for the closing quote exactly the same as the opening one, we can make a groups of it and use the backreference.
|
||||
|
||||
Here's the correct code:
|
||||
|
||||
```js run
|
||||
let str = `He said: "She's the one!".`;
|
||||
|
||||
*!*
|
||||
let reg = /(['"])(.*?)\1/g;
|
||||
*/!*
|
||||
|
||||
alert( str.match(reg) ); // "She's the one!"
|
||||
```
|
||||
|
||||
Now it works! The regular expression engine finds the first quote `pattern:(['"])` and remembers the content of `pattern:(...)`, that's the first capturing group.
|
||||
|
||||
Further in the pattern `pattern:\1` means "find the same text as in the first group", exactly the same quote in our case.
|
||||
|
||||
Please note:
|
||||
|
||||
- To reference a group inside a replacement string -- we use `$1`, while in the pattern -- a backslash `\1`.
|
||||
- If we use `?:` in the group, then we can't reference it. Groups that are excluded from capturing `(?:...)` are not remembered by the engine.
|
||||
|
||||
## Backreference by name: `\k<name>`
|
||||
|
||||
For named groups, we can backreference by `\k<name>`.
|
||||
|
||||
The same example with the named group:
|
||||
|
||||
```js run
|
||||
let str = `He said: "She's the one!".`;
|
||||
|
||||
*!*
|
||||
let reg = /(?<quote>['"])(.*?)\k<quote>/g;
|
||||
*/!*
|
||||
|
||||
alert( str.match(reg) ); // "She's the one!"
|
||||
```
|
|
@ -0,0 +1,33 @@
|
|||
|
||||
The first idea can be to list the languages with `|` in-between.
|
||||
|
||||
But that doesn't work right:
|
||||
|
||||
```js run
|
||||
let reg = /Java|JavaScript|PHP|C|C\+\+/g;
|
||||
|
||||
let str = "Java, JavaScript, PHP, C, C++";
|
||||
|
||||
alert( str.match(reg) ); // Java,Java,PHP,C,C
|
||||
```
|
||||
|
||||
The regular expression engine looks for alternations one-by-one. That is: first it checks if we have `match:Java`, otherwise -- looks for `match:JavaScript` and so on.
|
||||
|
||||
As a result, `match:JavaScript` can never be found, just because `match:Java` is checked first.
|
||||
|
||||
The same with `match:C` and `match:C++`.
|
||||
|
||||
There are two solutions for that problem:
|
||||
|
||||
1. Change the order to check the longer match first: `pattern:JavaScript|Java|C\+\+|C|PHP`.
|
||||
2. Merge variants with the same start: `pattern:Java(Script)?|C(\+\+)?|PHP`.
|
||||
|
||||
In action:
|
||||
|
||||
```js run
|
||||
let reg = /Java(Script)?|C(\+\+)?|PHP/g;
|
||||
|
||||
let str = "Java, JavaScript, PHP, C, C++";
|
||||
|
||||
alert( str.match(reg) ); // Java,JavaScript,PHP,C,C++
|
||||
```
|
|
@ -0,0 +1,11 @@
|
|||
# Find programming languages
|
||||
|
||||
There are many programming languages, for instance Java, JavaScript, PHP, C, C++.
|
||||
|
||||
Create a regexp that finds them in the string `subject:Java JavaScript PHP C++ C`:
|
||||
|
||||
```js
|
||||
let reg = /your regexp/g;
|
||||
|
||||
alert("Java JavaScript PHP C++ C".match(reg)); // Java JavaScript PHP C++ C
|
||||
```
|
|
@ -0,0 +1,23 @@
|
|||
|
||||
Opening tag is `pattern:\[(b|url|quote)\]`.
|
||||
|
||||
Then to find everything till the closing tag -- let's the pattern `pattern:[\s\S]*?` to match any character including the newline and then a backreference to the closing tag.
|
||||
|
||||
The full pattern: `pattern:\[(b|url|quote)\][\s\S]*?\[/\1\]`.
|
||||
|
||||
In action:
|
||||
|
||||
```js run
|
||||
let reg = /\[(b|url|quote)\][\s\S]*?\[\/\1\]/g;
|
||||
|
||||
let str = `
|
||||
[b]hello![/b]
|
||||
[quote]
|
||||
[url]http://google.com[/url]
|
||||
[/quote]
|
||||
`;
|
||||
|
||||
alert( str.match(reg) ); // [b]hello![/b],[quote][url]http://google.com[/url][/quote]
|
||||
```
|
||||
|
||||
Please note that we had to escape a slash for the closing tag `pattern:[/\1]`, because normally the slash closes the pattern.
|
|
@ -0,0 +1,48 @@
|
|||
# Find bbtag pairs
|
||||
|
||||
A "bb-tag" looks like `[tag]...[/tag]`, where `tag` is one of: `b`, `url` or `quote`.
|
||||
|
||||
For instance:
|
||||
```
|
||||
[b]text[/b]
|
||||
[url]http://google.com[/url]
|
||||
```
|
||||
|
||||
BB-tags can be nested. But a tag can't be nested into itself, for instance:
|
||||
|
||||
```
|
||||
Normal:
|
||||
[url] [b]http://google.com[/b] [/url]
|
||||
[quote] [b]text[/b] [/quote]
|
||||
|
||||
Impossible:
|
||||
[b][b]text[/b][/b]
|
||||
```
|
||||
|
||||
Tags can contain line breaks, that's normal:
|
||||
|
||||
```
|
||||
[quote]
|
||||
[b]text[/b]
|
||||
[/quote]
|
||||
```
|
||||
|
||||
Create a regexp to find all BB-tags with their contents.
|
||||
|
||||
For instance:
|
||||
|
||||
```js
|
||||
let reg = /your regexp/g;
|
||||
|
||||
let str = "..[url]http://google.com[/url]..";
|
||||
alert( str.match(reg) ); // [url]http://google.com[/url]
|
||||
```
|
||||
|
||||
If tags are nested, then we need the outer tag (if we want we can continue the search in its content):
|
||||
|
||||
```js
|
||||
let reg = /your regexp/g;
|
||||
|
||||
let str = "..[url][b]http://google.com[/b][/url]..";
|
||||
alert( str.match(reg) ); // [url][b]http://google.com[/b][/url]
|
||||
```
|
|
@ -0,0 +1,17 @@
|
|||
The solution: `pattern:/"(\\.|[^"\\])*"/g`.
|
||||
|
||||
Step by step:
|
||||
|
||||
- First we look for an opening quote `pattern:"`
|
||||
- Then if we have a backslash `pattern:\\` (we technically have to double it in the pattern, because it is a special character, so that's a single backslash in fact), then any character is fine after it (a dot).
|
||||
- Otherwise we take any character except a quote (that would mean the end of the string) and a backslash (to prevent lonely backslashes, the backslash is only used with some other symbol after it): `pattern:[^"\\]`
|
||||
- ...And so on till the closing quote.
|
||||
|
||||
In action:
|
||||
|
||||
```js run
|
||||
let reg = /"(\\.|[^"\\])*"/g;
|
||||
let str = ' .. "test me" .. "Say \\"Hello\\"!" .. "\\\\ \\"" .. ';
|
||||
|
||||
alert( str.match(reg) ); // "test me","Say \"Hello\"!","\\ \""
|
||||
```
|
|
@ -0,0 +1,32 @@
|
|||
# Find quoted strings
|
||||
|
||||
Create a regexp to find strings in double quotes `subject:"..."`.
|
||||
|
||||
The important part is that strings should support escaping, in the same way as JavaScript strings do. For instance, quotes can be inserted as `subject:\"` a newline as `subject:\n`, and the slash itself as `subject:\\`.
|
||||
|
||||
```js
|
||||
let str = "Just like \"here\".";
|
||||
```
|
||||
|
||||
For us it's important that an escaped quote `subject:\"` does not end a string.
|
||||
|
||||
So we should look from one quote to the other ignoring escaped quotes on the way.
|
||||
|
||||
That's the essential part of the task, otherwise it would be trivial.
|
||||
|
||||
Examples of strings to match:
|
||||
```js
|
||||
.. *!*"test me"*/!* ..
|
||||
.. *!*"Say \"Hello\"!"*/!* ... (escaped quotes inside)
|
||||
.. *!*"\\"*/!* .. (double slash inside)
|
||||
.. *!*"\\ \""*/!* .. (double slash and an escaped quote inside)
|
||||
```
|
||||
|
||||
In JavaScript we need to double the slashes to pass them right into the string, like this:
|
||||
|
||||
```js run
|
||||
let str = ' .. "test me" .. "Say \\"Hello\\"!" .. "\\\\ \\"" .. ';
|
||||
|
||||
// the in-memory string
|
||||
alert(str); // .. "test me" .. "Say \"Hello\"!" .. "\\ \"" ..
|
||||
```
|
|
@ -0,0 +1,16 @@
|
|||
|
||||
The pattern start is obvious: `pattern:<style`.
|
||||
|
||||
...But then we can't simply write `pattern:<style.*?>`, because `match:<styler>` would match it.
|
||||
|
||||
We need either a space after `match:<style` and then optionally something else or the ending `match:>`.
|
||||
|
||||
In the regexp language: `pattern:<style(>|\s.*?>)`.
|
||||
|
||||
In action:
|
||||
|
||||
```js run
|
||||
let reg = /<style(>|\s.*?>)/g;
|
||||
|
||||
alert( '<style> <styler> <style test="...">'.match(reg) ); // <style>, <style test="...">
|
||||
```
|
|
@ -0,0 +1,13 @@
|
|||
# Find the full tag
|
||||
|
||||
Write a regexp to find the tag `<style...>`. It should match the full tag: it may have no attributes `<style>` or have several of them `<style type="..." id="...">`.
|
||||
|
||||
...But the regexp should not match `<styler>`!
|
||||
|
||||
For instance:
|
||||
|
||||
```js
|
||||
let reg = /your regexp/g;
|
||||
|
||||
alert( '<style> <styler> <style test="...">'.match(reg) ); // <style>, <style test="...">
|
||||
```
|
59
9-regular-expressions/11-regexp-alternation/article.md
Normal file
|
@ -0,0 +1,59 @@
|
|||
# Alternation (OR) |
|
||||
|
||||
Alternation is the term in regular expression that is actually a simple "OR".
|
||||
|
||||
In a regular expression it is denoted with a vertical line character `pattern:|`.
|
||||
|
||||
For instance, we need to find programming languages: HTML, PHP, Java or JavaScript.
|
||||
|
||||
The corresponding regexp: `pattern:html|php|java(script)?`.
|
||||
|
||||
A usage example:
|
||||
|
||||
```js run
|
||||
let reg = /html|php|css|java(script)?/gi;
|
||||
|
||||
let str = "First HTML appeared, then CSS, then JavaScript";
|
||||
|
||||
alert( str.match(reg) ); // 'HTML', 'CSS', 'JavaScript'
|
||||
```
|
||||
|
||||
We already know a similar thing -- square brackets. They allow to choose between multiple character, for instance `pattern:gr[ae]y` matches `match:gray` or `match:grey`.
|
||||
|
||||
Square brackets allow only characters or character sets. Alternation allows any expressions. A regexp `pattern:A|B|C` means one of expressions `A`, `B` or `C`.
|
||||
|
||||
For instance:
|
||||
|
||||
- `pattern:gr(a|e)y` means exactly the same as `pattern:gr[ae]y`.
|
||||
- `pattern:gra|ey` means `match:gra` or `match:ey`.
|
||||
|
||||
To separate a part of the pattern for alternation we usually enclose it in parentheses, like this: `pattern:before(XXX|YYY)after`.
|
||||
|
||||
## Regexp for time
|
||||
|
||||
In previous chapters there was a task to build a regexp for searching time in the form `hh:mm`, for instance `12:00`. But a simple `pattern:\d\d:\d\d` is too vague. It accepts `25:99` as the time (99 seconds is valid, but shouldn't be).
|
||||
|
||||
How can we make a better one?
|
||||
|
||||
We can apply more careful matching. First, the hours:
|
||||
|
||||
- If the first digit is `0` or `1`, then the next digit can by anything.
|
||||
- Or, if the first digit is `2`, then the next must be `pattern:[0-3]`.
|
||||
|
||||
As a regexp: `pattern:[01]\d|2[0-3]`.
|
||||
|
||||
Next, the minutes must be from `0` to `59`. In the regexp language that means `pattern:[0-5]\d`: the first digit `0-5`, and then any digit.
|
||||
|
||||
Let's glue them together into the pattern: `pattern:[01]\d|2[0-3]:[0-5]\d`.
|
||||
|
||||
We're almost done, but there's a problem. The alternation `pattern:|` now happens to be between `pattern:[01]\d` and `pattern:2[0-3]:[0-5]\d`.
|
||||
|
||||
That's wrong, as it should be applied only to hours `[01]\d` OR `2[0-3]`. That's a common mistake when starting to work with regular expressions.
|
||||
|
||||
The correct variant:
|
||||
|
||||
```js run
|
||||
let reg = /([01]\d|2[0-3]):[0-5]\d/g;
|
||||
|
||||
alert("00:00 10:10 23:59 25:99 1:2".match(reg)); // 00:00,10:10,23:59
|
||||
```
|
|
@ -0,0 +1,6 @@
|
|||
|
||||
The empty string is the only match: it starts and immediately finishes.
|
||||
|
||||
The task once again demonstrates that anchors are not characters, but tests.
|
||||
|
||||
The string is empty `""`. The engine first matches the `pattern:^` (input start), yes it's there, and then immediately the end `pattern:$`, it's here too. So there's a match.
|
|
@ -0,0 +1,3 @@
|
|||
# Regexp ^$
|
||||
|
||||
Which string matches the pattern `pattern:^$`?
|
|
@ -0,0 +1,21 @@
|
|||
A two-digit hex number is `pattern:[0-9a-f]{2}` (assuming the `pattern:i` flag is enabled).
|
||||
|
||||
We need that number `NN`, and then `:NN` repeated 5 times (more numbers);
|
||||
|
||||
The regexp is: `pattern:[0-9a-f]{2}(:[0-9a-f]{2}){5}`
|
||||
|
||||
Now let's show that the match should capture all the text: start at the beginning and end at the end. That's done by wrapping the pattern in `pattern:^...$`.
|
||||
|
||||
Finally:
|
||||
|
||||
```js run
|
||||
let reg = /^[0-9a-fA-F]{2}(:[0-9a-fA-F]{2}){5}$/i;
|
||||
|
||||
alert( reg.test('01:32:54:67:89:AB') ); // true
|
||||
|
||||
alert( reg.test('0132546789AB') ); // false (no colons)
|
||||
|
||||
alert( reg.test('01:32:54:67:89') ); // false (5 numbers, need 6)
|
||||
|
||||
alert( reg.test('01:32:54:67:89:ZZ') ) // false (ZZ in the end)
|
||||
```
|
20
9-regular-expressions/12-regexp-anchors/2-test-mac/task.md
Normal file
|
@ -0,0 +1,20 @@
|
|||
# Check MAC-address
|
||||
|
||||
[MAC-address](https://en.wikipedia.org/wiki/MAC_address) of a network interface consists of 6 two-digit hex numbers separated by a colon.
|
||||
|
||||
For instance: `subject:'01:32:54:67:89:AB'`.
|
||||
|
||||
Write a regexp that checks whether a string is MAC-address.
|
||||
|
||||
Usage:
|
||||
```js
|
||||
let reg = /your regexp/;
|
||||
|
||||
alert( reg.test('01:32:54:67:89:AB') ); // true
|
||||
|
||||
alert( reg.test('0132546789AB') ); // false (no colons)
|
||||
|
||||
alert( reg.test('01:32:54:67:89') ); // false (5 numbers, must be 6)
|
||||
|
||||
alert( reg.test('01:32:54:67:89:ZZ') ) // false (ZZ ad the end)
|
||||
```
|
55
9-regular-expressions/12-regexp-anchors/article.md
Normal file
|
@ -0,0 +1,55 @@
|
|||
# String start ^ and finish $
|
||||
|
||||
The caret `pattern:'^'` and dollar `pattern:'$'` characters have special meaning in a regexp. They are called "anchors".
|
||||
|
||||
The caret `pattern:^` matches at the beginning of the text, and the dollar `pattern:$` -- in the end.
|
||||
|
||||
For instance, let's test if the text starts with `Mary`:
|
||||
|
||||
```js run
|
||||
let str1 = "Mary had a little lamb, it's fleece was white as snow";
|
||||
let str2 = 'Everywhere Mary went, the lamp was sure to go';
|
||||
|
||||
alert( /^Mary/.test(str1) ); // true
|
||||
alert( /^Mary/.test(str2) ); // false
|
||||
```
|
||||
|
||||
The pattern `pattern:^Mary` means: "the string start and then Mary".
|
||||
|
||||
Now let's test whether the text ends with an email.
|
||||
|
||||
To match an email, we can use a regexp `pattern:[-.\w]+@([\w-]+\.)+[\w-]{2,20}`.
|
||||
|
||||
To test whether the string ends with the email, let's add `pattern:$` to the pattern:
|
||||
|
||||
```js run
|
||||
let reg = /[-.\w]+@([\w-]+\.)+[\w-]{2,20}$/g;
|
||||
|
||||
let str1 = 'My email is mail@site.com';
|
||||
let str2 = 'Everywhere Mary went, the lamp was sure to go';
|
||||
|
||||
alert( reg.test(str1) ); // true
|
||||
alert( reg.test(str2) ); // false
|
||||
```
|
||||
|
||||
We can use both anchors together to check whether the string exactly follows the pattern. That's often used for validation.
|
||||
|
||||
For instance we want to check that `str` is exactly a color in the form `#` plus 6 hex digits. The pattern for the color is `pattern:#[0-9a-f]{6}`.
|
||||
|
||||
To check that the *whole string* exactly matches it, we add `pattern:^...$`:
|
||||
|
||||
```js run
|
||||
let str = "#abcdef";
|
||||
|
||||
alert( /^#[0-9a-f]{6}$/i.test(str) ); // true
|
||||
```
|
||||
|
||||
The regexp engine looks for the text start, then the color, and then immediately the text end. Just what we need.
|
||||
|
||||
```smart header="Anchors have zero length"
|
||||
Anchors just like `\b` are tests. They have zero-width.
|
||||
|
||||
In other words, they do not match a character, but rather force the regexp engine to check the condition (text start/end).
|
||||
```
|
||||
|
||||
The behavior of anchors changes if there's a flag `pattern:m` (multiline mode). We'll explore it in the next chapter.
|
76
9-regular-expressions/13-regexp-multiline-mode/article.md
Normal file
|
@ -0,0 +1,76 @@
|
|||
# Multiline mode, flag "m"
|
||||
|
||||
The multiline mode is enabled by the flag `pattern:/.../m`.
|
||||
|
||||
It only affects the behavior of `pattern:^` and `pattern:$`.
|
||||
|
||||
In the multiline mode they match not only at the beginning and end of the string, but also at start/end of line.
|
||||
|
||||
## Line start ^
|
||||
|
||||
In the example below the text has multiple lines. The pattern `pattern:/^\d+/gm` takes a number from the beginning of each one:
|
||||
|
||||
```js run
|
||||
let str = `1st place: Winnie
|
||||
2nd place: Piglet
|
||||
33rd place: Eeyore`;
|
||||
|
||||
*!*
|
||||
alert( str.match(/^\d+/gm) ); // 1, 2, 33
|
||||
*/!*
|
||||
```
|
||||
|
||||
Without the flag `pattern:/.../m` only the first number is matched:
|
||||
|
||||
|
||||
```js run
|
||||
let str = `1st place: Winnie
|
||||
2nd place: Piglet
|
||||
33rd place: Eeyore`;
|
||||
|
||||
*!*
|
||||
alert( str.match(/^\d+/g) ); // 1
|
||||
*/!*
|
||||
```
|
||||
|
||||
That's because by default a caret `pattern:^` only matches at the beginning of the text, and in the multiline mode -- at the start of a line.
|
||||
|
||||
The regular expression engine moves along the text and looks for a string start `pattern:^`, when finds -- continues to match the rest of the pattern `pattern:\d+`.
|
||||
|
||||
## Line end $
|
||||
|
||||
The dollar sign `pattern:$` behaves similarly.
|
||||
|
||||
The regular expression `pattern:\w+$` finds the last word in every line
|
||||
|
||||
```js run
|
||||
let str = `1st place: Winnie
|
||||
2nd place: Piglet
|
||||
33rd place: Eeyore`;
|
||||
|
||||
alert( str.match(/\w+$/gim) ); // Winnie,Piglet,Eeyore
|
||||
```
|
||||
|
||||
Without the `pattern:/.../m` flag the dollar `pattern:$` would only match the end of the whole string, so only the very last word would be found.
|
||||
|
||||
## Anchors ^$ versus \n
|
||||
|
||||
To find a newline, we can use not only `pattern:^` and `pattern:$`, but also the newline character `\n`.
|
||||
|
||||
The first difference is that unlike anchors, the character `\n` "consumes" the newline character and adds it to the result.
|
||||
|
||||
For instance, here we use it instead of `pattern:$`:
|
||||
|
||||
```js run
|
||||
let str = `1st place: Winnie
|
||||
2nd place: Piglet
|
||||
33rd place: Eeyore`;
|
||||
|
||||
alert( str.match(/\w+\n/gim) ); // Winnie\n,Piglet\n
|
||||
```
|
||||
|
||||
Here every match is a word plus a newline character.
|
||||
|
||||
And one more difference -- the newline `\n` does not match at the string end. That's why `Eeyore` is not found in the example above.
|
||||
|
||||
So, anchors are usually better, they are closer to what we want to get.
|
105
9-regular-expressions/14-regexp-lookahead-lookbehind/article.md
Normal file
|
@ -0,0 +1,105 @@
|
|||
# Lookahead and lookbehind
|
||||
|
||||
Sometimes we need to match a pattern only if followed by another pattern. For instance, we'd like to get the price from a string like `subject:1 turkey costs 30€`.
|
||||
|
||||
We need a number (let's say a price has no decimal point) followed by `subject:€` sign.
|
||||
|
||||
That's what lookahead is for.
|
||||
|
||||
## Lookahead
|
||||
|
||||
The syntax is: `pattern:x(?=y)`, it means "look for `pattern:x`, but match only if followed by `pattern:y`".
|
||||
|
||||
For an integer amount followed by `subject:€`, the regexp will be `pattern:\d+(?=€)`:
|
||||
|
||||
```js run
|
||||
let str = "1 turkey costs 30€";
|
||||
|
||||
alert( str.match(/\d+(?=€)/) ); // 30 (correctly skipped the sole number 1)
|
||||
```
|
||||
|
||||
Let's say we want a quantity instead, that is a number, NOT followed by `subject:€`.
|
||||
|
||||
Here a negative lookahead can be applied.
|
||||
|
||||
The syntax is: `pattern:x(?!y)`, it means "search `pattern:x`, but only if not followed by `pattern:y`".
|
||||
|
||||
```js run
|
||||
let str = "2 turkeys cost 60€";
|
||||
|
||||
alert( str.match(/\d+(?!€)/) ); // 2 (correctly skipped the price)
|
||||
```
|
||||
|
||||
## Lookbehind
|
||||
|
||||
Lookahead allows to add a condition for "what goes after".
|
||||
|
||||
Lookbehind is similar, but it looks behind. That is, it allows to match a pattern only if there's something before.
|
||||
|
||||
The syntax is:
|
||||
- Positive lookbehind: `pattern:(?<=y)x`, matches `pattern:x`, but only if it follows after `pattern:y`.
|
||||
- Negative lookbehind: `pattern:(?<!y)x`, matches `pattern:x`, but only if there's no `pattern:y` before.
|
||||
|
||||
For example, let's change the price to US dollars. The dollar sign is usually before the number, so to look for `$30` we'll use `pattern:(?<=\$)\d+` -- an amount preceeded by `subject:$`:
|
||||
|
||||
```js run
|
||||
let str = "1 turkey costs $30";
|
||||
|
||||
alert( str.match(/(?<=\$)\d+/) ); // 30 (skipped the sole number)
|
||||
```
|
||||
|
||||
And, to find the quantity -- a number, not preceeded by `subject:$`, we can use a negative lookbehind `pattern:(?<!\$)\d+`:
|
||||
|
||||
```js run
|
||||
let str = "2 turkeys cost $60";
|
||||
|
||||
alert( str.match(/(?<!\$)\d+/) ); // 2 (skipped the price)
|
||||
```
|
||||
|
||||
## Capture groups
|
||||
|
||||
Generally, what's inside the lookaround (a common name for both lookahead and lookbehind) parentheses does not become a part of the match.
|
||||
|
||||
E.g. in the pattern `pattern:\d+(?!€)`, the `pattern:€` sign doesn't get captured as a part of the match.
|
||||
|
||||
But if we want to capture the whole lookaround expression or a part of it, that's possible. Just need to wrap that into additional parentheses.
|
||||
|
||||
For instance, here the currency `pattern:(€|kr)` is captured, along with the amount:
|
||||
|
||||
```js run
|
||||
let str = "1 turkey costs 30€";
|
||||
let reg = /\d+(?=(€|kr))/; // extra parentheses around €|kr
|
||||
|
||||
alert( str.match(reg) ); // 30, €
|
||||
```
|
||||
|
||||
And here's the same for lookbehind:
|
||||
|
||||
```js run
|
||||
let str = "1 turkey costs $30";
|
||||
let reg = /(?<=(\$|£))\d+/;
|
||||
|
||||
alert( str.match(reg) ); // 30, $
|
||||
```
|
||||
|
||||
Please note that for lookbehind the order stays be same, even though lookahead parentheses are before the main pattern.
|
||||
|
||||
Usually parentheses are numbered left-to-right, but lookbehind is an exception, it is always captured after the main pattern. So the match for `pattern:\d+` goes in the result first, and then for `pattern:(\$|£)`.
|
||||
|
||||
|
||||
## Summary
|
||||
|
||||
Lookahead and lookbehind (commonly referred to as "lookaround") are useful for simple regular expressions, when we'd like not to take something into the match depending on the context before/after it.
|
||||
|
||||
Sometimes we can do the same manually, that is: match all and filter by context in the loop. Remember, `str.matchAll` and `reg.exec` return matches with `.index` property, so we know where exactly in the text it is. But generally regular expressions can do it better.
|
||||
|
||||
Lookaround types:
|
||||
|
||||
| Pattern | type | matches |
|
||||
|--------------------|------------------|---------|
|
||||
| `pattern:x(?=y)` | Positive lookahead | `x` if followed by `y` |
|
||||
| `pattern:x(?!y)` | Negative lookahead | `x` if not followed by `y` |
|
||||
| `pattern:(?<=y)x` | Positive lookbehind | `x` if after `y` |
|
||||
| `pattern:(?<!y)x` | Negative lookbehind | `x` if not after `y` |
|
||||
|
||||
Lookahead can also used to disable backtracking. Why that may be needed -- see in the next chapter.
|
|
@ -0,0 +1,293 @@
|
|||
# Infinite backtracking problem
|
||||
|
||||
Some regular expressions are looking simple, but can execute veeeeeery long time, and even "hang" the JavaScript engine.
|
||||
|
||||
Sooner or later most developers occasionally face such behavior.
|
||||
|
||||
The typical situation -- a regular expression works fine sometimes, but for certain strings it "hangs" consuming 100% of CPU.
|
||||
|
||||
In a web-browser it kills the page. Not a good thing for sure.
|
||||
|
||||
For server-side Javascript it may become a vulnerability, and it uses regular expressions to process user data. Bad input will make the process hang, causing denial of service. The author personally saw and reported such vulnerabilities even for very well-known and widely used programs.
|
||||
|
||||
So the problem is definitely worth to deal with.
|
||||
|
||||
## Introduction
|
||||
|
||||
The plan will be like this:
|
||||
|
||||
1. First we see the problem how it may occur.
|
||||
2. Then we simplify the situation and see why it occurs.
|
||||
3. Then we fix it.
|
||||
|
||||
For instance let's consider searching tags in HTML.
|
||||
|
||||
We want to find all tags, with or without attributes -- like `subject:<a href="..." class="doc" ...>`. We need the regexp to work reliably, because HTML comes from the internet and can be messy.
|
||||
|
||||
In particular, we need it to match tags like `<a test="<>" href="#">` -- with `<` and `>` in attributes. That's allowed by [HTML standard](https://html.spec.whatwg.org/multipage/syntax.html#syntax-attributes).
|
||||
|
||||
Now we can see that a simple regexp like `pattern:<[^>]+>` doesn't work, because it stops at the first `>`, and we need to ignore `<>` if inside an attribute.
|
||||
|
||||
```js run
|
||||
// the match doesn't reach the end of the tag - wrong!
|
||||
alert( '<a test="<>" href="#">'.match(/<[^>]+>/) ); // <a test="<>
|
||||
```
|
||||
|
||||
To correctly handle such situations we need a more complex regular expression. It will have the form `pattern:<tag (key=value)*>`.
|
||||
|
||||
1. For the `tag` name: `pattern:\w+`,
|
||||
2. For the `key` name: `pattern:\w+`,
|
||||
3. And the `value`: a quoted string `pattern:"[^"]*"`.
|
||||
|
||||
If we substitute these into the pattern above and throw in some optional spaces `pattern:\s`, the full regexp becomes: `pattern:<\w+(\s*\w+="[^"]*"\s*)*>`.
|
||||
|
||||
That regexp is not perfect! It doesn't yet support all details of HTML, for instance unquoted values, and there are other ways to improve, but let's not add complexity. It will demonstrate the problem for us.
|
||||
|
||||
The regexp seems to work:
|
||||
|
||||
```js run
|
||||
let reg = /<\w+(\s*\w+="[^"]*"\s*)*>/g;
|
||||
|
||||
let str='...<a test="<>" href="#">... <b>...';
|
||||
|
||||
alert( str.match(reg) ); // <a test="<>" href="#">, <b>
|
||||
```
|
||||
|
||||
Great! It found both the long tag `match:<a test="<>" href="#">` and the short one `match:<b>`.
|
||||
|
||||
Now, that we've got a seemingly working solution, let's get to the infinite backtracking itself.
|
||||
|
||||
## Infinite backtracking
|
||||
|
||||
If you run our regexp on the input below, it may hang the browser (or another JavaScript host):
|
||||
|
||||
```js run
|
||||
let reg = /<\w+(\s*\w+="[^"]*"\s*)*>/g;
|
||||
|
||||
let str = `<tag a="b" a="b" a="b" a="b" a="b" a="b" a="b" a="b"
|
||||
a="b" a="b" a="b" a="b" a="b" a="b" a="b" a="b" a="b" a="b" a="b" a="b" a="b"`;
|
||||
|
||||
*!*
|
||||
// The search will take a long, long time
|
||||
alert( str.match(reg) );
|
||||
*/!*
|
||||
```
|
||||
|
||||
Some regexp engines can handle that search, but most of them can't.
|
||||
|
||||
What's the matter? Why a simple regular expression "hangs" on such a small string?
|
||||
|
||||
Let's simplify the regexp by stripping the tag name and the quotes. So that we look only for `key=value` attributes: `pattern:<(\s*\w+=\w+\s*)*>`.
|
||||
|
||||
Unfortunately, the regexp still hangs:
|
||||
|
||||
```js run
|
||||
// only search for space-delimited attributes
|
||||
let reg = /<(\s*\w+=\w+\s*)*>/g;
|
||||
|
||||
let str = `<a=b a=b a=b a=b a=b a=b a=b a=b
|
||||
a=b a=b a=b a=b a=b a=b a=b a=b a=b a=b a=b a=b a=b a=b`;
|
||||
|
||||
*!*
|
||||
// the search will take a long, long time
|
||||
alert( str.match(reg) );
|
||||
*/!*
|
||||
```
|
||||
|
||||
Here we end the demo of the problem and start looking into what's going on, why it hangs and how to fix it.
|
||||
|
||||
## Detailed example
|
||||
|
||||
To make an example even simpler, let's consider `pattern:(\d+)*$`.
|
||||
|
||||
This regular expression also has the same problem. In most regexp engines that search takes a very long time (careful -- can hang):
|
||||
|
||||
```js run
|
||||
alert( '12345678901234567890123456789123456789z'.match(/(\d+)*$/) );
|
||||
```
|
||||
|
||||
So what's wrong with the regexp?
|
||||
|
||||
First, one may notice that the regexp is a little bit strange. The quantifier `pattern:*` looks extraneous. If we want a number, we can use `pattern:\d+$`.
|
||||
|
||||
Indeed, the regexp is artificial. But the reason why it is slow is the same as those we saw above. So let's understand it, and then the previous example will become obvious.
|
||||
|
||||
What happen during the search of `pattern:(\d+)*$` in the line `subject:123456789z`?
|
||||
|
||||
1. First, the regexp engine tries to find a number `pattern:\d+`. The plus `pattern:+` is greedy by default, so it consumes all digits:
|
||||
|
||||
```
|
||||
\d+.......
|
||||
(123456789)z
|
||||
```
|
||||
2. Then it tries to apply the star quantifier, but there are no more digits, so it the star doesn't give anything.
|
||||
|
||||
3. Then the pattern expects to see the string end `pattern:$`, and in the text we have `subject:z`, so there's no match:
|
||||
|
||||
```
|
||||
X
|
||||
\d+........$
|
||||
(123456789)z
|
||||
```
|
||||
|
||||
4. As there's no match, the greedy quantifier `pattern:+` decreases the count of repetitions (backtracks).
|
||||
|
||||
Now `\d+` doesn't take all digits, but all except the last one:
|
||||
```
|
||||
\d+.......
|
||||
(12345678)9z
|
||||
```
|
||||
5. Now the engine tries to continue the search from the new position (`9`).
|
||||
|
||||
The star `pattern:(\d+)*` can be applied -- it gives the number `match:9`:
|
||||
|
||||
```
|
||||
|
||||
\d+.......\d+
|
||||
(12345678)(9)z
|
||||
```
|
||||
|
||||
The engine tries to match `$` again, but fails, because meets `subject:z`:
|
||||
|
||||
```
|
||||
X
|
||||
\d+.......\d+
|
||||
(12345678)(9)z
|
||||
```
|
||||
|
||||
|
||||
5. There's no match, so the engine will continue backtracking, decreasing the number of repetitions for `pattern:\d+` down to 7 digits. So the rest of the string `subject:89` becomes the second `pattern:\d+`:
|
||||
|
||||
```
|
||||
X
|
||||
\d+......\d+
|
||||
(1234567)(89)z
|
||||
```
|
||||
|
||||
...Still no match for `pattern:$`.
|
||||
|
||||
The search engine backtracks again. Backtracking generally works like this: the last greedy quantifier decreases the number of repetitions until it can. Then the previous greedy quantifier decreases, and so on. In our case the last greedy quantifier is the second `pattern:\d+`, from `subject:89` to `subject:8`, and then the star takes `subject:9`:
|
||||
|
||||
```
|
||||
X
|
||||
\d+......\d+\d+
|
||||
(1234567)(8)(9)z
|
||||
```
|
||||
6. ...Fail again. The second and third `pattern:\d+` backtracked to the end, so the first quantifier shortens the match to `subject:123456`, and the star takes the rest:
|
||||
|
||||
```
|
||||
X
|
||||
\d+.......\d+
|
||||
(123456)(789)z
|
||||
```
|
||||
|
||||
Again no match. The process repeats: the last greedy quantifier releases one character (`9`):
|
||||
|
||||
```
|
||||
X
|
||||
\d+.....\d+ \d+
|
||||
(123456)(78)(9)z
|
||||
```
|
||||
7. ...And so on.
|
||||
|
||||
The regular expression engine goes through all combinations of `123456789` and their subsequences. There are a lot of them, that's why it takes so long.
|
||||
|
||||
What to do?
|
||||
|
||||
Should we turn on the lazy mode?
|
||||
|
||||
Unfortunately, it doesn't: if we replace `pattern:\d+` with `pattern:\d+?`, that still hangs:
|
||||
|
||||
```js run
|
||||
// sloooooowwwwww
|
||||
alert( '12345678901234567890123456789123456789z'.match(/(\d+?)*$/) );
|
||||
```
|
||||
|
||||
Lazy quantifiers actually do the same, but in the reverse order.
|
||||
|
||||
Just think about how the search engine would work in this case.
|
||||
|
||||
Some regular expression engines have tricky built-in checks to detect infinite backtracking or other means to work around them, but there's no universal solution.
|
||||
|
||||
## Back to tags
|
||||
|
||||
In the example above, when we search `pattern:<(\s*\w+=\w+\s*)*>` in the string `subject:<a=b a=b a=b a=b` -- the similar thing happens.
|
||||
|
||||
The string has no `>` at the end, so the match is impossible, but the regexp engine doesn't know about it. The search backtracks trying different combinations of `pattern:(\s*\w+=\w+\s*)`:
|
||||
|
||||
```
|
||||
(a=b a=b a=b) (a=b)
|
||||
(a=b a=b) (a=b a=b)
|
||||
(a=b) (a=b a=b a=b)
|
||||
...
|
||||
```
|
||||
|
||||
## How to fix?
|
||||
|
||||
The backtracking checks many variants that are an obvious fail for a human.
|
||||
|
||||
For instance, in the pattern `pattern:(\d+)*$` a human can easily see that `pattern:(\d+)*` does not need to backtrack `pattern:+`. There's no difference between one or two `\d+`:
|
||||
|
||||
```
|
||||
\d+........
|
||||
(123456789)z
|
||||
|
||||
\d+...\d+....
|
||||
(1234)(56789)z
|
||||
```
|
||||
|
||||
Let's get back to more real-life example: `pattern:<(\s*\w+=\w+\s*)*>`. We want it to find pairs `name=value` (as many as it can).
|
||||
|
||||
What we would like to do is to forbid backtracking.
|
||||
|
||||
There's totally no need to decrease the number of repetitions.
|
||||
|
||||
In other words, if it found three `name=value` pairs and then can't find `>` after them, then there's no need to decrease the count of repetitions. There are definitely no `>` after those two (we backtracked one `name=value` pair, it's there):
|
||||
|
||||
```
|
||||
(name=value) name=value
|
||||
```
|
||||
|
||||
Modern regexp engines support so-called "possessive" quantifiers for that. They are like greedy, but don't backtrack at all. Pretty simple, they capture whatever they can, and the search continues. There's also another tool called "atomic groups" that forbid backtracking inside parentheses.
|
||||
|
||||
Unfortunately, but both these features are not supported by JavaScript.
|
||||
|
||||
### Lookahead to the rescue
|
||||
|
||||
We can forbid backtracking using lookahead.
|
||||
|
||||
The pattern to take as much repetitions as possible without backtracking is: `pattern:(?=(a+))\1`.
|
||||
|
||||
In other words:
|
||||
- The lookahead `pattern:?=` looks for the maximal count `pattern:a+` from the current position.
|
||||
- And then they are "consumed into the result" by the backreference `pattern:\1` (`pattern:\1` corresponds to the content of the second parentheses, that is `pattern:a+`).
|
||||
|
||||
There will be no backtracking, because lookahead does not backtrack. If it found like 5 times of `pattern:a+` and the further match failed, then it doesn't go back to 4.
|
||||
|
||||
```smart
|
||||
There's more about the relation between possessive quantifiers and lookahead in articles [Regex: Emulate Atomic Grouping (and Possessive Quantifiers) with LookAhead](http://instanceof.me/post/52245507631/regex-emulate-atomic-grouping-with-lookahead) and [Mimicking Atomic Groups](http://blog.stevenlevithan.com/archives/mimic-atomic-groups).
|
||||
```
|
||||
|
||||
So this trick makes the problem disappear.
|
||||
|
||||
Let's fix the regexp for a tag with attributes from the beginning of the chapter`pattern:<\w+(\s*\w+=(\w+|"[^"]*")\s*)*>`. We'll use lookahead to prevent backtracking of `name=value` pairs:
|
||||
|
||||
```js run
|
||||
// regexp to search name=value
|
||||
let attrReg = /(\s*\w+=(\w+|"[^"]*")\s*)/
|
||||
|
||||
// use new RegExp to nicely insert its source into (?=(a+))\1
|
||||
let fixedReg = new RegExp(`<\\w+(?=(${attrReg.source}*))\\1>`, 'g');
|
||||
|
||||
let goodInput = '...<a test="<>" href="#">... <b>...';
|
||||
|
||||
let badInput = `<tag a=b a=b a=b a=b a=b a=b a=b a=b
|
||||
a=b a=b a=b a=b a=b a=b a=b a=b a=b a=b a=b a=b a=b`;
|
||||
|
||||
alert( goodInput.match(fixedReg) ); // <a test="<>" href="#">, <b>
|
||||
alert( badInput.match(fixedReg) ); // null (no results, fast!)
|
||||
```
|
||||
|
||||
Great, it works! We found both a long tag `match:<a test="<>" href="#">` and a small one `match:<b>`, and (!) didn't hang the engine on the bad input.
|
||||
|
||||
Please note the `attrReg.source` property. `RegExp` objects provide access to their source string in it. That's convenient when we want to insert one regexp into another.
|
89
9-regular-expressions/20-regexp-unicode/article.md
Normal file
|
@ -0,0 +1,89 @@
|
|||
|
||||
# Unicode: flag "u"
|
||||
|
||||
The unicode flag `/.../u` enables the correct support of surrogate pairs.
|
||||
|
||||
Surrogate pairs are explained in the chapter <info:string>.
|
||||
|
||||
Let's briefly remind them here. In short, normally characters are encoded with 2 bytes. That gives us 65536 characters maximum. But there are more characters in the world.
|
||||
|
||||
So certain rare characters are encoded with 4 bytes, like `𝒳` (mathematical X) or `😄` (a smile).
|
||||
|
||||
Here are the unicode values to compare:
|
||||
|
||||
| Character | Unicode | Bytes |
|
||||
|------------|---------|--------|
|
||||
| `a` | 0x0061 | 2 |
|
||||
| `≈` | 0x2248 | 2 |
|
||||
|`𝒳`| 0x1d4b3 | 4 |
|
||||
|`𝒴`| 0x1d4b4 | 4 |
|
||||
|`😄`| 0x1f604 | 4 |
|
||||
|
||||
So characters like `a` and `≈` occupy 2 bytes, and those rare ones take 4.
|
||||
|
||||
The unicode is made in such a way that the 4-byte characters only have a meaning as a whole.
|
||||
|
||||
In the past JavaScript did not know about that, and many string methods still have problems. For instance, `length` thinks that here are two characters:
|
||||
|
||||
```js run
|
||||
alert('😄'.length); // 2
|
||||
alert('𝒳'.length); // 2
|
||||
```
|
||||
|
||||
...But we can see that there's only one, right? The point is that `length` treats 4 bytes as two 2-byte characters. That's incorrect, because they must be considered only together (so-called "surrogate pair").
|
||||
|
||||
Normally, regular expressions also treat "long characters" as two 2-byte ones.
|
||||
|
||||
That leads to odd results, for instance let's try to find `pattern:[𝒳𝒴]` in the string `subject:𝒳`:
|
||||
|
||||
```js run
|
||||
alert( '𝒳'.match(/[𝒳𝒴]/) ); // odd result (wrong match actually, "half-character")
|
||||
```
|
||||
|
||||
The result is wrong, because by default the regexp engine does not understand surrogate pairs.
|
||||
|
||||
So, it thinks that `[𝒳𝒴]` are not two, but four characters:
|
||||
1. the left half of `𝒳` `(1)`,
|
||||
2. the right half of `𝒳` `(2)`,
|
||||
3. the left half of `𝒴` `(3)`,
|
||||
4. the right half of `𝒴` `(4)`.
|
||||
|
||||
We can list them like this:
|
||||
|
||||
```js run
|
||||
for(let i=0; i<'𝒳𝒴'.length; i++) {
|
||||
alert('𝒳𝒴'.charCodeAt(i)); // 55349, 56499, 55349, 56500
|
||||
};
|
||||
```
|
||||
|
||||
So it finds only the "left half" of `𝒳`.
|
||||
|
||||
In other words, the search works like `'12'.match(/[1234]/)`: only `1` is returned.
|
||||
|
||||
## The "u" flag
|
||||
|
||||
The `/.../u` flag fixes that.
|
||||
|
||||
It enables surrogate pairs in the regexp engine, so the result is correct:
|
||||
|
||||
```js run
|
||||
alert( '𝒳'.match(/[𝒳𝒴]/u) ); // 𝒳
|
||||
```
|
||||
|
||||
Let's see one more example.
|
||||
|
||||
If we forget the `u` flag and occasionally use surrogate pairs, then we can get an error:
|
||||
|
||||
```js run
|
||||
'𝒳'.match(/[𝒳-𝒴]/); // SyntaxError: invalid range in character class
|
||||
```
|
||||
|
||||
Normally, regexps understand `[a-z]` as a "range of characters with codes between codes of `a` and `z`.
|
||||
|
||||
But without `u` flag, surrogate pairs are assumed to be a "pair of independant characters", so `[𝒳-𝒴]` is like `[<55349><56499>-<55349><56500>]` (replaced each surrogate pair with code points). Now we can clearly see that the range `56499-55349` is unacceptable, as the left range border must be less than the right one.
|
||||
|
||||
Using the `u` flag makes it work right:
|
||||
|
||||
```js run
|
||||
alert( '𝒴'.match(/[𝒳-𝒵]/u) ); // 𝒴
|
||||
```
|
|
@ -0,0 +1,86 @@
|
|||
|
||||
# Unicode character properies \p
|
||||
|
||||
[Unicode](https://en.wikipedia.org/wiki/Unicode), the encoding format used by Javascript strings, has a lot of properties for different characters (or, technically, code points). They describe which "categories" character belongs to, and a variety of technical details.
|
||||
|
||||
In regular expressions these can be set by `\p{…}`. And there must be flag `'u'`.
|
||||
|
||||
For instance, `\p{Letter}` denotes a letter in any of language. We can also use `\p{L}`, as `L` is an alias of `Letter`, there are shorter aliases for almost every property.
|
||||
|
||||
Here's the main tree of properties:
|
||||
|
||||
- Letter `L`:
|
||||
- lowercase `Ll`, modifier `Lm`, titlecase `Lt`, uppercase `Lu`, other `Lo`
|
||||
- Number `N`:
|
||||
- decimal digit `Nd`, letter number `Nl`, other `No`:
|
||||
- Punctuation `P`:
|
||||
- connector `Pc`, dash `Pd`, initial quote `Pi`, final quote `Pf`, open `Ps`, close `Pe`, other `Po`
|
||||
- Mark `M` (accents etc):
|
||||
- spacing combining `Mc`, enclosing `Me`, non-spacing `Mn`
|
||||
- Symbol `S`:
|
||||
- currency `Sc`, modifier `Sk`, math `Sm`, other `So`
|
||||
- Separator `Z`:
|
||||
- line `Zl`, paragraph `Zp`, space `Zs`
|
||||
- Other `C`:
|
||||
- control `Cc`, format `Cf`, not assigned `Cn`, private use `Co`, surrogate `Cs`.
|
||||
|
||||
```smart header="More information"
|
||||
Interested to see which characters belong to a property? There's a tool at <http://cldr.unicode.org/unicode-utilities/list-unicodeset> for that.
|
||||
|
||||
You could also explore properties at [Character Property Index](http://unicode.org/cldr/utility/properties.jsp).
|
||||
|
||||
For the full Unicode Character Database in text format (along with all properties), see <https://www.unicode.org/Public/UCD/latest/ucd/>.
|
||||
```
|
||||
|
||||
There are also other derived categories, like:
|
||||
- `Alphabetic` (`Alpha`), includes Letters `L`, plus letter numbers `Nl` (e.g. roman numbers Ⅻ), plus some other symbols `Other_Alphabetic` (`OAltpa`).
|
||||
- `Hex_Digit` includes hexadimal digits: `0-9`, `a-f`.
|
||||
- ...Unicode is a big beast, it includes a lot of properties.
|
||||
|
||||
For instance, let's look for a 6-digit hex number:
|
||||
|
||||
```js run
|
||||
let reg = /\p{Hex_Digit}{6}/u; // flag 'u' is requireds
|
||||
|
||||
alert("color: #123ABC".match(reg)); // 123ABC
|
||||
```
|
||||
|
||||
There are also properties with a value. For instance, Unicode "Script" (a writing system) can be Cyrillic, Greek, Arabic, Han (Chinese) etc, the [list is long]("https://en.wikipedia.org/wiki/Script_(Unicode)").
|
||||
|
||||
To search for certain scripts, we should supply `Script=<value>`, e.g. to search for cyrillic letters: `\p{sc=Cyrillic}`, for Chinese glyphs: `\p{sc=Han}`, etc:
|
||||
|
||||
```js run
|
||||
let regexp = /\p{sc=Han}+/gu; // get chinese words
|
||||
|
||||
let str = `Hello Привет 你好 123_456`;
|
||||
|
||||
alert( str.match(regexp) ); // 你好
|
||||
```
|
||||
|
||||
## Building multi-language \w
|
||||
|
||||
Let's make a "universal" regexp for `pattern:\w`, for any language. That task has a standard solution in many programming languages with unicode-aware regexps, e.g. Perl.
|
||||
|
||||
```js
|
||||
/[\p{Alphabetic}\p{Mark}\p{Decimal_Number}\p{Connector_Punctuation}\p{Join_Control}]/u
|
||||
```
|
||||
|
||||
Let's decipher. Remember, `pattern:\w` is actually the same as `pattern:[a-zA-Z0-9_]`.
|
||||
|
||||
So the character set includes:
|
||||
|
||||
- `Alphabetic` for letters,
|
||||
- `Mark` for accents, as in Unicode accents may be represented by separate code points,
|
||||
- `Decimal_Number` for numbers,
|
||||
- `Connector_Punctuation` for the `'_'` character and alike,
|
||||
- `Join_Control` -– two special code points with hex codes `200c` and `200d`, used in ligatures e.g. in arabic.
|
||||
|
||||
Or, if we replace long names with aliases (a list of aliases [here](https://www.unicode.org/Public/UCD/latest/ucd/PropertyValueAliases.txt)):
|
||||
|
||||
```js run
|
||||
let regexp = /([\p{Alpha}\p{M}\p{Nd}\p{Pc}\p{Join_C}]+)/gu;
|
||||
|
||||
let str = `Hello Привет 你好 123_456`;
|
||||
|
||||
alert( str.match(regexp) ); // Hello,Привет,你好,123_456
|
||||
```
|
71
9-regular-expressions/22-regexp-sticky/article.md
Normal file
|
@ -0,0 +1,71 @@
|
|||
|
||||
# Sticky flag "y", searching at position
|
||||
|
||||
To grasp the use case of `y` flag, and see how great it is, let's explore a practical use case.
|
||||
|
||||
One of common tasks for regexps is "parsing": when we get a text and analyze it for logical components, build a structure.
|
||||
|
||||
For instance, there are HTML parsers for browser pages, that turn text into a structured document. There are parsers for programming languages, like Javascript, etc.
|
||||
|
||||
Writing parsers is a special area, with its own tools and algorithms, so we don't go deep in there, but there's a very common question: "What is the text at the given position?".
|
||||
|
||||
For instance, for a programming language variants can be like:
|
||||
- Is it a "name" `pattern:\w+`?
|
||||
- Or is it a number `pattern:\d+`?
|
||||
- Or an operator `pattern:[+-/*]`?
|
||||
- (a syntax error if it's not anything in the expected list)
|
||||
|
||||
In Javascript, to perform a search starting from a given position, we can use `regexp.exec` with `regexp.lastIndex` property, but that's not we need!
|
||||
|
||||
We'd like to check the match exactly at given position, not "starting" from it.
|
||||
|
||||
Here's a (failing) attempt to use `lastIndex`:
|
||||
|
||||
```js run
|
||||
let str = "(text before) function ...";
|
||||
|
||||
// attempting to find function at position 5:
|
||||
let regexp = /function/g; // must use "g" flag, otherwise lastIndex is ignored
|
||||
regexp.lastIndex = 5
|
||||
|
||||
alert (regexp.exec(str)); // function
|
||||
```
|
||||
|
||||
The match is found, because `regexp.exec` starts to search from the given position and goes on by the text, successfully matching "function" later.
|
||||
|
||||
We could work around that by checking if "`regexp.exec(str).index` property is `5`, and if not, ignore the much. But the main problem here is performance.
|
||||
|
||||
The regexp engine does a lot of unnecessary work by scanning at further positions. The delays are clearly noticeable if the text is long, because there are many such searches in a parser.
|
||||
|
||||
## The "y" flag
|
||||
|
||||
So we've came to the problem: how to search for a match, starting exactly at the given position.
|
||||
|
||||
That's what `y` flag does. It makes the regexp search only at the `lastIndex` position.
|
||||
|
||||
Here's an example
|
||||
|
||||
```js run
|
||||
let str = "(text before) function ...";
|
||||
|
||||
*!*
|
||||
let regexp = /function/y;
|
||||
regexp.lastIndex = 5;
|
||||
*/!*
|
||||
|
||||
alert (regexp.exec(str)); // null (no match, unlike "g" flag!)
|
||||
|
||||
*!*
|
||||
regexp.lastIndex = 14;
|
||||
*/!*
|
||||
|
||||
alert (regexp.exec(str)); // function (match!)
|
||||
```
|
||||
|
||||
As we can see, now the regexp is only matched at the given position.
|
||||
|
||||
So what `y` does is truly unique, and very important for writing parsers.
|
||||
|
||||
The `y` flag allows to apply a regular expression (or many of them one-by-one) exactly at the given position and when we understand what's there, we can move on -- step by step examining the text.
|
||||
|
||||
Without the flag the regexp engine always searches till the end of the text, that takes time, especially if the text is large. So our parser would be very slow. The `y` flag is exactly the right thing here.
|
3
9-regular-expressions/index.md
Normal file
|
@ -0,0 +1,3 @@
|
|||
# Regular expressions
|
||||
|
||||
Regular expressions is a powerful way of doing search and replace in strings.
|