components

This commit is contained in:
Ilya Kantor 2019-04-02 14:01:44 +03:00
parent 304d578b54
commit 6fb4aabcba
344 changed files with 669 additions and 406 deletions

View file

@ -0,0 +1,127 @@
# Patterns and flags
Regular expressions is a powerful way of searching and replacing inside a string.
In JavaScript regular expressions are implemented using objects of a built-in `RegExp` class and integrated with strings.
Please note that regular expressions vary between programming languages. In this tutorial we concentrate on JavaScript. Of course there's a lot in common, but they are a somewhat different in Perl, Ruby, PHP etc.
## Regular expressions
A regular expression (also "regexp", or just "reg") consists of a *pattern* and optional *flags*.
There are two syntaxes to create a regular expression object.
The long syntax:
```js
regexp = new RegExp("pattern", "flags");
```
...And the short one, using slashes `"/"`:
```js
regexp = /pattern/; // no flags
regexp = /pattern/gmi; // with flags g,m and i (to be covered soon)
```
Slashes `"/"` tell JavaScript that we are creating a regular expression. They play the same role as quotes for strings.
## Usage
To search inside a string, we can use method [search](mdn:js/String/search).
Here's an example:
```js run
let str = "I love JavaScript!"; // will search here
let regexp = /love/;
alert( str.search(regexp) ); // 2
```
The `str.search` method looks for the pattern `pattern:/love/` and returns the position inside the string. As we might guess, `pattern:/love/` is the simplest possible pattern. What it does is a simple substring search.
The code above is the same as:
```js run
let str = "I love JavaScript!"; // will search here
let substr = 'love';
alert( str.search(substr) ); // 2
```
So searching for `pattern:/love/` is the same as searching for `"love"`.
But that's only for now. Soon we'll create more complex regular expressions with much more searching power.
```smart header="Colors"
From here on the color scheme is:
- regexp -- `pattern:red`
- string (where we search) -- `subject:blue`
- result -- `match:green`
```
````smart header="When to use `new RegExp`?"
Normally we use the short syntax `/.../`. But it does not allow any variable insertions, so we must know the exact regexp at the time of writing the code.
On the other hand, `new RegExp` allows to construct a pattern dynamically from a string.
So we can figure out what we need to search and create `new RegExp` from it:
```js run
let search = prompt("What you want to search?", "love");
let regexp = new RegExp(search);
// find whatever the user wants
alert( "I love JavaScript".search(regexp));
```
````
## Flags
Regular expressions may have flags that affect the search.
There are only 5 of them in JavaScript:
`i`
: With this flag the search is case-insensitive: no difference between `A` and `a` (see the example below).
`g`
: With this flag the search looks for all matches, without it -- only the first one (we'll see uses in the next chapter).
`m`
: Multiline mode (covered in the chapter <info:regexp-multiline>).
`s`
: "Dotall" mode, allows `.` to match newlines (covered in the chapter <info:regexp-character-classes>).
`u`
: Enables full unicode support. The flag enables correct processing of surrogate pairs. More about that in the chapter <info:regexp-unicode>.
`y`
: Sticky mode (covered in the chapter <info:regexp-sticky>)
We'll cover all these flags further in the tutorial.
For now, the simplest flag is `i`, here's an example:
```js run
let str = "I love JavaScript!";
alert( str.search(/LOVE/i) ); // 2 (found lowercased)
alert( str.search(/LOVE/) ); // -1 (nothing found without 'i' flag)
```
So the `i` flag already makes regular expressions more powerful than a simple substring search. But there's so much more. We'll cover other flags and features in the next chapters.
## Summary
- A regular expression consists of a pattern and optional flags: `g`, `i`, `m`, `u`, `s`, `y`.
- Without flags and special symbols that we'll study later, the search by a regexp is the same as a substring search.
- The method `str.search(regexp)` returns the index where the match is found or `-1` if there's no match. In the next chapter we'll see other methods.

View file

@ -0,0 +1,458 @@
# Methods of RegExp and String
There are two sets of methods to deal with regular expressions.
1. First, regular expressions are objects of the built-in [RegExp](mdn:js/RegExp) class, it provides many methods.
2. Besides that, there are methods in regular strings can work with regexps.
## Recipes
Which method to use depends on what we'd like to do.
Methods become much easier to understand if we separate them by their use in real-life tasks:
**To search for all matches:**
Use regexp `g` flag and:
- Get a flat array of matches -- `str.match(reg)`
- Get an array or matches with details -- `str.matchAll(reg)`.
**To search for the first match only:**
- Get the full first match -- `str.match(reg)` (without `g` flag).
- Get the string position of the first match -- `str.search(reg)`.
- Check if there's a match -- `regexp.test(str)`.
- Find the match from the given position -- `regexp.exec(str)` (set `regexp.lastIndex` to position).
**To replace all matches:**
- Replace with another string or a function result -- `str.replace(reg, str|func)`
**To split the string by a separator:**
- `str.split(str|reg)`
Now you get the details about every method in this chapter... But if you're reading for the first time, and want to know more about regexps - go ahead!
You may want to skip methods for now, move on to the next chapter, and then return here if something about a method is unclear.
## str.search(reg)
We've seen this method already. It returns the position of the first match or `-1` if none found:
```js run
let str = "A drop of ink may make a million think";
alert( str.search( *!*/a/i*/!* ) ); // 0 (the first position)
```
**The important limitation: `search` only finds the first match.**
We can't find next positions using `search`, there's just no syntax for that. But there are other methods that can.
## str.match(reg), no "g" flag
The behavior of `str.match` varies depending on whether `reg` has `g` flag or not.
First, if there's no `g` flag, then `str.match(reg)` looks for the first match only.
The result is an array with that match and additional properties:
- `index` -- the position of the match inside the string,
- `input` -- the subject string.
For instance:
```js run
let str = "Fame is the thirst of youth";
let result = str.match( *!*/fame/i*/!* );
alert( result[0] ); // Fame (the match)
alert( result.index ); // 0 (at the zero position)
alert( result.input ); // "Fame is the thirst of youth" (the string)
```
A match result may have more than one element.
**If a part of the pattern is delimited by parentheses `(...)`, then it becomes a separate element in the array.**
If parentheses have a name, designated by `(?<name>...)` at their start, then `result.groups[name]` has the content. We'll see that later in the chapter [about groups](info:regexp-groups).
For instance:
```js run
let str = "JavaScript is a programming language";
let result = str.match( *!*/JAVA(SCRIPT)/i*/!* );
alert( result[0] ); // JavaScript (the whole match)
alert( result[1] ); // script (the part of the match that corresponds to the parentheses)
alert( result.index ); // 0
alert( result.input ); // JavaScript is a programming language
```
Due to the `i` flag the search is case-insensitive, so it finds `match:JavaScript`. The part of the match that corresponds to `pattern:SCRIPT` becomes a separate array item.
So, this method is used to find one full match with all details.
## str.match(reg) with "g" flag
When there's a `"g"` flag, then `str.match` returns an array of all matches. There are no additional properties in that array, and parentheses do not create any elements.
For instance:
```js run
let str = "HO-Ho-ho!";
let result = str.match( *!*/ho/ig*/!* );
alert( result ); // HO, Ho, ho (array of 3 matches, case-insensitive)
```
Parentheses do not change anything, here we go:
```js run
let str = "HO-Ho-ho!";
let result = str.match( *!*/h(o)/ig*/!* );
alert( result ); // HO, Ho, ho
```
**So, with `g` flag `str.match` returns a simple array of all matches, without details.**
If we want to get information about match positions and contents of parentheses then we should use `matchAll` method that we'll cover below.
````warn header="If there are no matches, `str.match` returns `null`"
Please note, that's important. If there are no matches, the result is not an empty array, but `null`.
Keep that in mind to evade pitfalls like this:
```js run
let str = "Hey-hey-hey!";
alert( str.match(/Z/g).length ); // Error: Cannot read property 'length' of null
```
Here `str.match(/Z/g)` is `null`, it has no `length` property.
````
## str.matchAll(regexp)
The method `str.matchAll(regexp)` is used to find all matches with all details.
For instance:
```js run
let str = "Javascript or JavaScript? Should we uppercase 'S'?";
let result = str.matchAll( *!*/java(script)/ig*/!* );
let [match1, match2] = result;
alert( match1[0] ); // Javascript (the whole match)
alert( match1[1] ); // script (the part of the match that corresponds to the parentheses)
alert( match1.index ); // 0
alert( match1.input ); // = str (the whole original string)
alert( match2[0] ); // JavaScript (the whole match)
alert( match2[1] ); // Script (the part of the match that corresponds to the parentheses)
alert( match2.index ); // 14
alert( match2.input ); // = str (the whole original string)
```
````warn header="`matchAll` returns an iterable, not array"
For instance, if we try to get the first match by index, it won't work:
```js run
let str = "Javascript or JavaScript??";
let result = str.matchAll( /javascript/ig );
*!*
alert(result[0]); // undefined (?! there must be a match)
*/!*
```
The reason is that the iterator is not an array. We need to run `Array.from(result)` on it, or use `for..of` loop to get matches.
In practice, if we need all matches, then `for..of` works, so it's not a problem.
And, to get only few matches, we can use destructuring:
```js run
let str = "Javascript or JavaScript??";
*!*
let [firstMatch] = str.matchAll( /javascript/ig );
*/!*
alert(firstMatch); // Javascript
```
````
```warn header="`matchAll` is supernew, may need a polyfill"
The method may not work in old browsers. A polyfill might be needed (this site uses core-js).
Or you could make a loop with `regexp.exec`, explained below.
```
## str.split(regexp|substr, limit)
Splits the string using the regexp (or a substring) as a delimiter.
We already used `split` with strings, like this:
```js run
alert('12-34-56'.split('-')) // array of [12, 34, 56]
```
But we can split by a regular expression, the same way:
```js run
alert('12-34-56'.split(/-/)) // array of [12, 34, 56]
```
## str.replace(str|reg, str|func)
That's actually a great method, one of most useful ones. The swiss army knife for searching and replacing.
The simplest use -- searching and replacing a substring, like this:
```js run
// replace a dash by a colon
alert('12-34-56'.replace("-", ":")) // 12:34-56
```
There's a pitfall though.
**When the first argument of `replace` is a string, it only looks for the first match.**
You can see that in the example above: only the first `"-"` is replaced by `":"`.
To find all dashes, we need to use not the string `"-"`, but a regexp `pattern:/-/g`, with an obligatory `g` flag:
```js run
// replace all dashes by a colon
alert( '12-34-56'.replace( *!*/-/g*/!*, ":" ) ) // 12:34:56
```
The second argument is a replacement string. We can use special characters in it:
| Symbol | Inserts |
|--------|--------|
|`$$`|`"$"` |
|`$&`|the whole match|
|<code>$&#096;</code>|a part of the string before the match|
|`$'`|a part of the string after the match|
|`$n`|if `n` is a 1-2 digit number, then it means the contents of n-th parentheses counting from left to right, otherwise it means a parentheses with the given name |
For instance if we use `$&` in the replacement string, that means "put the whole match here".
Let's use it to prepend all entries of `"John"` with `"Mr."`:
```js run
let str = "John Doe, John Smith and John Bull";
// for each John - replace it with Mr. and then John
alert(str.replace(/John/g, 'Mr.$&')); // Mr.John Doe, Mr.John Smith and Mr.John Bull
```
Quite often we'd like to reuse parts of the source string, recombine them in the replacement or wrap into something.
To do so, we should:
1. First, mark the parts by parentheses in regexp.
2. Use `$1`, `$2` (and so on) in the replacement string to get the content matched by parentheses.
For instance:
```js run
let str = "John Smith";
// swap first and last name
alert(str.replace(/(john) (smith)/i, '$2, $1')) // Smith, John
```
**For situations that require "smart" replacements, the second argument can be a function.**
It will be called for each match, and its result will be inserted as a replacement.
For instance:
```js run
let i = 0;
// replace each "ho" by the result of the function
alert("HO-Ho-ho".replace(/ho/gi, function() {
return ++i;
})); // 1-2-3
```
In the example above the function just returns the next number every time, but usually the result is based on the match.
The function is called with arguments `func(str, p1, p2, ..., pn, offset, input, groups)`:
1. `str` -- the match,
2. `p1, p2, ..., pn` -- contents of parentheses (if there are any),
3. `offset` -- position of the match,
4. `input` -- the source string,
5. `groups` -- an object with named groups (see chapter [](info:regexp-groups)).
If there are no parentheses in the regexp, then there are only 3 arguments: `func(str, offset, input)`.
Let's use it to show full information about matches:
```js run
// show and replace all matches
function replacer(str, offset, input) {
alert(`Found ${str} at position ${offset} in string ${input}`);
return str.toLowerCase();
}
let result = "HO-Ho-ho".replace(/ho/gi, replacer);
alert( 'Result: ' + result ); // Result: ho-ho-ho
// shows each match:
// Found HO at position 0 in string HO-Ho-ho
// Found Ho at position 3 in string HO-Ho-ho
// Found ho at position 6 in string HO-Ho-ho
```
In the example below there are two parentheses, so `replacer` is called with 5 arguments: `str` is the full match, then parentheses, and then `offset` and `input`:
```js run
function replacer(str, name, surname, offset, input) {
// name is the first parentheses, surname is the second one
return surname + ", " + name;
}
let str = "John Smith";
alert(str.replace(/(John) (Smith)/, replacer)) // Smith, John
```
Using a function gives us the ultimate replacement power, because it gets all the information about the match, has access to outer variables and can do everything.
## regexp.exec(str)
We've already seen these searching methods:
- `search` -- looks for the position of the match,
- `match` -- if there's no `g` flag, returns the first match with parentheses and all details,
- `match` -- if there's a `g` flag -- returns all matches, without details parentheses,
- `matchAll` -- returns all matches with details.
The `regexp.exec` method is the most flexible searching method of all. Unlike previous methods, `exec` should be called on a regexp, rather than on a string.
It behaves differently depending on whether the regexp has the `g` flag.
If there's no `g`, then `regexp.exec(str)` returns the first match, exactly as `str.match(reg)`. Such behavior does not give us anything new.
But if there's `g`, then:
- `regexp.exec(str)` returns the first match and *remembers* the position after it in `regexp.lastIndex` property.
- The next call starts to search from `regexp.lastIndex` and returns the next match.
- If there are no more matches then `regexp.exec` returns `null` and `regexp.lastIndex` is set to `0`.
We could use it to get all matches with their positions and parentheses groups in a loop, instead of `matchAll`:
```js run
let str = 'A lot about JavaScript at https://javascript.info';
let regexp = /javascript/ig;
let result;
while (result = regexp.exec(str)) {
alert( `Found ${result[0]} at ${result.index}` );
// shows: Found JavaScript at 12, then:
// shows: Found javascript at 34
}
```
Surely, `matchAll` does the same, at least for modern browsers. But what `matchAll` can't do -- is to search from a given position.
Let's search from position `13`. What we need is to assign `regexp.lastIndex=13` and call `regexp.exec`:
```js run
let str = "A lot about JavaScript at https://javascript.info";
let regexp = /javascript/ig;
*!*
regexp.lastIndex = 13;
*/!*
let result;
while (result = regexp.exec(str)) {
alert( `Found ${result[0]} at ${result.index}` );
// shows: Found javascript at 34
}
```
Now, starting from the given position `13`, there's only one match.
## regexp.test(str)
The method `regexp.test(str)` looks for a match and returns `true/false` whether it finds it.
For instance:
```js run
let str = "I love JavaScript";
// these two tests do the same
alert( *!*/love/i*/!*.test(str) ); // true
alert( str.search(*!*/love/i*/!*) != -1 ); // true
```
An example with the negative answer:
```js run
let str = "Bla-bla-bla";
alert( *!*/love/i*/!*.test(str) ); // false
alert( str.search(*!*/love/i*/!*) != -1 ); // false
```
If the regexp has `'g'` flag, then `regexp.test` advances `regexp.lastIndex` property, just like `regexp.exec`.
So we can use it to search from a given position:
```js run
let regexp = /love/gi;
let str = "I love JavaScript";
// start the search from position 10:
regexp.lastIndex = 10
alert( regexp.test(str) ); // false (no match)
```
````warn header="Same global regexp tested repeatedly may fail to match"
If we apply the same global regexp to different inputs, it may lead to wrong result, because `regexp.test` call advances `regexp.lastIndex` property, so next matches start from non-zero position.
For instance, here we call `regexp.test` twice on the same text, and the second time fails:
```js run
let regexp = /javascript/g; // (regexp just created: regexp.lastIndex=0)
alert( regexp.test("javascript") ); // true (regexp.lastIndex=10 now)
alert( regexp.test("javascript") ); // false
```
That's exactly because `regexp.lastIndex` is non-zero on the second test.
To work around that, one could use non-global regexps or re-adjust `regexp.lastIndex=0` before a new search.
````
## Summary
There's a variety of many methods on both regexps and strings.
Their abilities and methods overlap quite a bit, we can do the same by different calls. Sometimes that may cause confusion when starting to learn the language.
Then please refer to the recipes at the beginning of this chapter, as they provide solutions for the majority of regexp-related tasks.

View file

@ -0,0 +1,6 @@
The answer: `pattern:\b\d\d:\d\d\b`.
```js run
alert( "Breakfast at 09:00 in the room 123:456.".match( /\b\d\d:\d\d\b/ ) ); // 09:00
```

View file

@ -0,0 +1,8 @@
# Find the time
The time has a format: `hours:minutes`. Both hours and minutes has two digits, like `09:00`.
Make a regexp to find time in the string: `subject:Breakfast at 09:00 in the room 123:456.`
P.S. In this task there's no need to check time correctness yet, so `25:99` can also be a valid result.
P.P.S. The regexp shouldn't match `123:456`.

View file

@ -0,0 +1,265 @@
# Character classes
Consider a practical task -- we have a phone number `"+7(903)-123-45-67"`, and we need to turn it into pure numbers: `79035419441`.
To do so, we can find and remove anything that's not a number. Character classes can help with that.
A character class is a special notation that matches any symbol from a certain set.
For the start, let's explore a "digit" class. It's written as `\d`. We put it in the pattern, that means "any single digit".
For instance, the let's find the first digit in the phone number:
```js run
let str = "+7(903)-123-45-67";
let reg = /\d/;
alert( str.match(reg) ); // 7
```
Without the flag `g`, the regular expression only looks for the first match, that is the first digit `\d`.
Let's add the `g` flag to find all digits:
```js run
let str = "+7(903)-123-45-67";
let reg = /\d/g;
alert( str.match(reg) ); // array of matches: 7,9,0,3,1,2,3,4,5,6,7
alert( str.match(reg).join('') ); // 79035419441
```
That was a character class for digits. There are other character classes as well.
Most used are:
`\d` ("d" is from "digit")
: A digit: a character from `0` to `9`.
`\s` ("s" is from "space")
: A space symbol: that includes spaces, tabs, newlines.
`\w` ("w" is from "word")
: A "wordly" character: either a letter of English alphabet or a digit or an underscore. Non-english letters (like cyrillic or hindi) do not belong to `\w`.
For instance, `pattern:\d\s\w` means a "digit" followed by a "space character" followed by a "wordly character", like `"1 a"`.
**A regexp may contain both regular symbols and character classes.**
For instance, `pattern:CSS\d` matches a string `match:CSS` with a digit after it:
```js run
let str = "CSS4 is cool";
let reg = /CSS\d/
alert( str.match(reg) ); // CSS4
```
Also we can use many character classes:
```js run
alert( "I love HTML5!".match(/\s\w\w\w\w\d/) ); // 'HTML5'
```
The match (each character class corresponds to one result character):
![](love-html5-classes.png)
## Word boundary: \b
A word boundary `pattern:\b` -- is a special character class.
It does not denote a character, but rather a boundary between characters.
For instance, `pattern:\bJava\b` matches `match:Java` in the string `subject:Hello, Java!`, but not in the script `subject:Hello, JavaScript!`.
```js run
alert( "Hello, Java!".match(/\bJava\b/) ); // Java
alert( "Hello, JavaScript!".match(/\bJava\b/) ); // null
```
The boundary has "zero width" in a sense that usually a character class means a character in the result (like a wordly character or a digit), but not in this case.
The boundary is a test.
When regular expression engine is doing the search, it's moving along the string in an attempt to find the match. At each string position it tries to find the pattern.
When the pattern contains `pattern:\b`, it tests that the position in string is a word boundary, that is one of three variants:
- Immediately before is `\w`, and immediately after -- not `\w`, or vise versa.
- At string start, and the first string character is `\w`.
- At string end, and the last string character is `\w`.
For instance, in the string `subject:Hello, Java!` the following positions match `\b`:
![](hello-java-boundaries.png)
So it matches `pattern:\bHello\b`, because:
1. At the beginning of the string the first `\b` test matches.
2. Then the word `Hello` matches.
3. Then `\b` matches, as we're between `o` and a space.
Pattern `pattern:\bJava\b` also matches. But not `pattern:\bHell\b` (because there's no word boundary after `l`) and not `Java!\b` (because the exclamation sign is not a wordly character, so there's no word boundary after it).
```js run
alert( "Hello, Java!".match(/\bHello\b/) ); // Hello
alert( "Hello, Java!".match(/\bJava\b/) ); // Java
alert( "Hello, Java!".match(/\bHell\b/) ); // null (no match)
alert( "Hello, Java!".match(/\bJava!\b/) ); // null (no match)
```
Once again let's note that `pattern:\b` makes the searching engine to test for the boundary, so that `pattern:Java\b` finds `match:Java` only when followed by a word boundary, but it does not add a letter to the result. §
Usually we use `\b` to find standalone English words. So that if we want `"Java"` language then `pattern:\bJava\b` finds exactly a standalone word and ignores it when it's a part of `"JavaScript"`.
Another example: a regexp `pattern:\b\d\d\b` looks for standalone two-digit numbers. In other words, it requires that before and after `pattern:\d\d` must be a symbol different from `\w` (or beginning/end of the string).
```js run
alert( "1 23 456 78".match(/\b\d\d\b/g) ); // 23,78
```
```warn header="Word boundary doesn't work for non-English alphabets"
The word boundary check `\b` tests for a boundary between `\w` and something else. But `\w` means an English letter (or a digit or an underscore), so the test won't work for other characters (like cyrillic or hieroglyphs).
```
## Inverse classes
For every character class there exists an "inverse class", denoted with the same letter, but uppercased.
The "reverse" means that it matches all other characters, for instance:
`\D`
: Non-digit: any character except `\d`, for instance a letter.
`\S`
: Non-space: any character except `\s`, for instance a letter.
`\W`
: Non-wordly character: anything but `\w`.
`\B`
: Non-boundary: a test reverse to `\b`.
In the beginning of the chapter we saw how to get all digits from the phone `subject:+7(903)-123-45-67`.
One way was to match all digits and join them:
```js run
let str = "+7(903)-123-45-67";
alert( str.match(/\d/g).join('') ); // 79031234567
```
An alternative, shorter way is to find non-digits `\D` and remove them from the string:
```js run
let str = "+7(903)-123-45-67";
alert( str.replace(/\D/g, "") ); // 79031234567
```
## Spaces are regular characters
Usually we pay little attention to spaces. For us strings `subject:1-5` and `subject:1 - 5` are nearly identical.
But if a regexp doesn't take spaces into account, it may fail to work.
Let's try to find digits separated by a dash:
```js run
alert( "1 - 5".match(/\d-\d/) ); // null, no match!
```
Here we fix it by adding spaces into the regexp `pattern:\d - \d`:
```js run
alert( "1 - 5".match(/\d - \d/) ); // 1 - 5, now it works
```
**A space is a character. Equal in importance with any other character.**
Of course, spaces in a regexp are needed only if we look for them. Extra spaces (just like any other extra characters) may prevent a match:
```js run
alert( "1-5".match(/\d - \d/) ); // null, because the string 1-5 has no spaces
```
In other words, in a regular expression all characters matter, spaces too.
## A dot is any character
The dot `"."` is a special character class that matches "any character except a newline".
For instance:
```js run
alert( "Z".match(/./) ); // Z
```
Or in the middle of a regexp:
```js run
let reg = /CS.4/;
alert( "CSS4".match(reg) ); // CSS4
alert( "CS-4".match(reg) ); // CS-4
alert( "CS 4".match(reg) ); // CS 4 (space is also a character)
```
Please note that the dot means "any character", but not the "absense of a character". There must be a character to match it:
```js run
alert( "CS4".match(/CS.4/) ); // null, no match because there's no character for the dot
```
### The dotall "s" flag
Usually a dot doesn't match a newline character.
For instance, this doesn't match:
```js run
alert( "A\nB".match(/A.B/) ); // null (no match)
// a space character would match
// or a letter, but not \n
```
Sometimes it's inconvenient, we really want "any character", newline included.
That's what `s` flag does. If a regexp has it, then the dot `"."` match literally any character:
```js run
alert( "A\nB".match(/A.B/s) ); // A\nB (match!)
```
## Summary
There exist following character classes:
- `pattern:\d` -- digits.
- `pattern:\D` -- non-digits.
- `pattern:\s` -- space symbols, tabs, newlines.
- `pattern:\S` -- all but `pattern:\s`.
- `pattern:\w` -- English letters, digits, underscore `'_'`.
- `pattern:\W` -- all but `pattern:\w`.
- `pattern:.` -- any character if with the regexp `'s'` flag, otherwise any except a newline.
...But that's not all!
Modern Javascript also allows to look for characters by their Unicode properties, for instance:
- A cyrillic letter is: `pattern:\p{Script=Cyrillic}` or `pattern:\p{sc=Cyrillic}`.
- A dash (be it a small hyphen `-` or a long dash `—`): `pattern:\p{Dash_Punctuation}` or `pattern:\p{pd}`.
- A currency symbol: `pattern:\p{Currency_Symbol}` or `pattern:\p{sc}`.
- ...And much more. Unicode has a lot of character categories that we can select from.
These patterns require `'u'` regexp flag to work. More about that in the chapter [](info:regexp-unicode).

Binary file not shown.

After

Width:  |  Height:  |  Size: 3.6 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 7.5 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 4 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 8.6 KiB

View file

@ -0,0 +1,99 @@
# Escaping, special characters
As we've seen, a backslash `"\"` is used to denote character classes. So it's a special character in regexps (just like in a regular string).
There are other special characters as well, that have special meaning in a regexp. They are used to do more powerful searches. Here's a full list of them: `pattern:[ \ ^ $ . | ? * + ( )`.
Don't try to remember the list -- soon we'll deal with each of them separately and you'll know them by heart automatically.
## Escaping
Let's say we want to find a dot literally. Not "any character", but just a dot.
To use a special character as a regular one, prepend it with a backslash: `pattern:\.`.
That's also called "escaping a character".
For example:
```js run
alert( "Chapter 5.1".match(/\d\.\d/) ); // 5.1 (match!)
alert( "Chapter 511".match(/\d\.\d/) ); // null (looking for a real dot \.)
```
Parentheses are also special characters, so if we want them, we should use `pattern:\(`. The example below looks for a string `"g()"`:
```js run
alert( "function g()".match(/g\(\)/) ); // "g()"
```
If we're looking for a backslash `\`, it's a special character in both regular strings and regexps, so we should double it.
```js run
alert( "1\\2".match(/\\/) ); // '\'
```
## A slash
A slash symbol `'/'` is not a special character, but in JavaScript it is used to open and close the regexp: `pattern:/...pattern.../`, so we should escape it too.
Here's what a search for a slash `'/'` looks like:
```js run
alert( "/".match(/\//) ); // '/'
```
From the other hand, if we're not using `/.../`, but create a regexp using `new RegExp`, then we no need to escape it:
```js run
alert( "/".match(new RegExp("/")) ); // '/'
```
## new RegExp
If we are creating a regular expression with `new RegExp`, then we don't have to escape `/`, but need to do some other escaping.
For instance, consider this:
```js run
let reg = new RegExp("\d\.\d");
alert( "Chapter 5.1".match(reg) ); // null
```
It worked with `pattern:/\d\.\d/`, but with `new RegExp("\d\.\d")` it doesn't, why?
The reason is that backslashes are "consumed" by a string. Remember, regular strings have their own special characters like `\n`, and a backslash is used for escaping.
Please, take a look, what "\d\.\d" really is:
```js run
alert("\d\.\d"); // d.d
```
The quotes "consume" backslashes and interpret them, for instance:
- `\n` -- becomes a newline character,
- `\u1234` -- becomes the Unicode character with such code,
- ...And when there's no special meaning: like `\d` or `\z`, then the backslash is simply removed.
So the call to `new RegExp` gets a string without backslashes. That's why it doesn't work!
To fix it, we need to double backslashes, because quotes turn `\\` into `\`:
```js run
*!*
let regStr = "\\d\\.\\d";
*/!*
alert(regStr); // \d\.\d (correct now)
let reg = new RegExp(regStr);
alert( "Chapter 5.1".match(reg) ); // 5.1
```
## Summary
- To search special characters `pattern:[ \ ^ $ . | ? * + ( )` literally, we need to prepend them with `\` ("escape them").
- We also need to escape `/` if we're inside `pattern:/.../` (but not inside `new RegExp`).
- When passing a string `new RegExp`, we need to double backslashes `\\`, cause strings consume one of them.

View file

@ -0,0 +1,12 @@
Answers: **no, yes**.
- In the script `subject:Java` it doesn't match anything, because `pattern:[^script]` means "any character except given ones". So the regexp looks for `"Java"` followed by one such symbol, but there's a string end, no symbols after it.
```js run
alert( "Java".match(/Java[^script]/) ); // null
```
- Yes, because the regexp is case-insensitive, the `pattern:[^script]` part matches the character `"S"`.
```js run
alert( "JavaScript".match(/Java[^script]/) ); // "JavaS"
```

View file

@ -0,0 +1,5 @@
# Java[^script]
We have a regexp `pattern:/Java[^script]/`.
Does it match anything in the string `subject:Java`? In the string `subject:JavaScript`?

View file

@ -0,0 +1,8 @@
Answer: `pattern:\d\d[-:]\d\d`.
```js run
let reg = /\d\d[-:]\d\d/g;
alert( "Breakfast at 09:00. Dinner at 21-30".match(reg) ); // 09:00, 21-30
```
Please note that the dash `pattern:'-'` has a special meaning in square brackets, but only between other characters, not when it's in the beginning or at the end, so we don't need to escape it.

View file

@ -0,0 +1,12 @@
# Find the time as hh:mm or hh-mm
The time can be in the format `hours:minutes` or `hours-minutes`. Both hours and minutes have 2 digits: `09:00` or `21-30`.
Write a regexp to find time:
```js
let reg = /your regexp/g;
alert( "Breakfast at 09:00. Dinner at 21-30".match(reg) ); // 09:00, 21-30
```
P.S. In this task we assume that the time is always correct, there's no need to filter out bad strings like "45:67". Later we'll deal with that too.

View file

@ -0,0 +1,114 @@
# Sets and ranges [...]
Several characters or character classes inside square brackets `[…]` mean to "search for any character among given".
## Sets
For instance, `pattern:[eao]` means any of the 3 characters: `'a'`, `'e'`, or `'o'`.
That's called a *set*. Sets can be used in a regexp along with regular characters:
```js run
// find [t or m], and then "op"
alert( "Mop top".match(/[tm]op/gi) ); // "Mop", "top"
```
Please note that although there are multiple characters in the set, they correspond to exactly one character in the match.
So the example above gives no matches:
```js run
// find "V", then [o or i], then "la"
alert( "Voila".match(/V[oi]la/) ); // null, no matches
```
The pattern assumes:
- `pattern:V`,
- then *one* of the letters `pattern:[oi]`,
- then `pattern:la`.
So there would be a match for `match:Vola` or `match:Vila`.
## Ranges
Square brackets may also contain *character ranges*.
For instance, `pattern:[a-z]` is a character in range from `a` to `z`, and `pattern:[0-5]` is a digit from `0` to `5`.
In the example below we're searching for `"x"` followed by two digits or letters from `A` to `F`:
```js run
alert( "Exception 0xAF".match(/x[0-9A-F][0-9A-F]/g) ); // xAF
```
Please note that in the word `subject:Exception` there's a substring `subject:xce`. It didn't match the pattern, because the letters are lowercase, while in the set `pattern:[0-9A-F]` they are uppercase.
If we want to find it too, then we can add a range `a-f`: `pattern:[0-9A-Fa-f]`. The `i` flag would allow lowercase too.
**Character classes are shorthands for certain character sets.**
For instance:
- **\d** -- is the same as `pattern:[0-9]`,
- **\w** -- is the same as `pattern:[a-zA-Z0-9_]`,
- **\s** -- is the same as `pattern:[\t\n\v\f\r ]` plus few other unicode space characters.
We can use character classes inside `[…]` as well.
For instance, we want to match all wordly characters or a dash, for words like "twenty-third". We can't do it with `pattern:\w+`, because `pattern:\w` class does not include a dash. But we can use `pattern:[\w-]`.
We also can use a combination of classes to cover every possible character, like `pattern:[\s\S]`. That matches spaces or non-spaces -- any character. That's wider than a dot `"."`, because the dot matches any character except a newline.
## Excluding ranges
Besides normal ranges, there are "excluding" ranges that look like `pattern:[^…]`.
They are denoted by a caret character `^` at the start and match any character *except the given ones*.
For instance:
- `pattern:[^aeyo]` -- any character except `'a'`, `'e'`, `'y'` or `'o'`.
- `pattern:[^0-9]` -- any character except a digit, the same as `\D`.
- `pattern:[^\s]` -- any non-space character, same as `\S`.
The example below looks for any characters except letters, digits and spaces:
```js run
alert( "alice15@gmail.com".match(/[^\d\sA-Z]/gi) ); // @ and .
```
## No escaping in […]
Usually when we want to find exactly the dot character, we need to escape it like `pattern:\.`. And if we need a backslash, then we use `pattern:\\`.
In square brackets the vast majority of special characters can be used without escaping:
- A dot `pattern:'.'`.
- A plus `pattern:'+'`.
- Parentheses `pattern:'( )'`.
- Dash `pattern:'-'` in the beginning or the end (where it does not define a range).
- A caret `pattern:'^'` if not in the beginning (where it means exclusion).
- And the opening square bracket `pattern:'['`.
In other words, all special characters are allowed except where they mean something for square brackets.
A dot `"."` inside square brackets means just a dot. The pattern `pattern:[.,]` would look for one of characters: either a dot or a comma.
In the example below the regexp `pattern:[-().^+]` looks for one of the characters `-().^+`:
```js run
// No need to escape
let reg = /[-().^+]/g;
alert( "1 + 2 - 3".match(reg) ); // Matches +, -
```
...But if you decide to escape them "just in case", then there would be no harm:
```js run
// Escaped everything
let reg = /[\-\(\)\.\^\+]/g;
alert( "1 + 2 - 3".match(reg) ); // also works: +, -
```

View file

@ -0,0 +1,9 @@
Solution:
```js run
let reg = /\.{3,}/g;
alert( "Hello!... How goes?.....".match(reg) ); // ..., .....
```
Please note that the dot is a special character, so we have to escape it and insert as `\.`.

View file

@ -0,0 +1,14 @@
importance: 5
---
# How to find an ellipsis "..." ?
Create a regexp to find ellipsis: 3 (or more?) dots in a row.
Check it:
```js
let reg = /your regexp/g;
alert( "Hello!... How goes?.....".match(reg) ); // ..., .....
```

View file

@ -0,0 +1,31 @@
We need to look for `#` followed by 6 hexadimal characters.
A hexadimal character can be described as `pattern:[0-9a-fA-F]`. Or if we use the `i` flag, then just `pattern:[0-9a-f]`.
Then we can look for 6 of them using the quantifier `pattern:{6}`.
As a result, we have the regexp: `pattern:/#[a-f0-9]{6}/gi`.
```js run
let reg = /#[a-f0-9]{6}/gi;
let str = "color:#121212; background-color:#AA00ef bad-colors:f#fddee #fd2"
alert( str.match(reg) ); // #121212,#AA00ef
```
The problem is that it finds the color in longer sequences:
```js run
alert( "#12345678".match( /#[a-f0-9]{6}/gi ) ) // #12345678
```
To fix that, we can add `pattern:\b` to the end:
```js run
// color
alert( "#123456".match( /#[a-f0-9]{6}\b/gi ) ); // #123456
// not a color
alert( "#12345678".match( /#[a-f0-9]{6}\b/gi ) ); // null
```

View file

@ -0,0 +1,15 @@
# Regexp for HTML colors
Create a regexp to search HTML-colors written as `#ABCDEF`: first `#` and then 6 hexadimal characters.
An example of use:
```js
let reg = /...your regexp.../
let str = "color:#121212; background-color:#AA00ef bad-colors:f#fddee #fd2 #12345678";
alert( str.match(reg) ) // #121212,#AA00ef
```
P.S. In this task we do not need other color formats like `#123` or `rgb(1,2,3)` etc.

View file

@ -0,0 +1,140 @@
# Quantifiers +, *, ? and {n}
Let's say we have a string like `+7(903)-123-45-67` and want to find all numbers in it. But unlike before, we are interested not in single digits, but full numbers: `7, 903, 123, 45, 67`.
A number is a sequence of 1 or more digits `\d`. To mark how many we need, we need to append a *quantifier*.
## Quantity {n}
The simplest quantifier is a number in curly braces: `pattern:{n}`.
A quantifier is appended to a character (or a character class, or a `[...]` set etc) and specifies how many we need.
It has a few advanced forms, let's see examples:
The exact count: `{5}`
: `pattern:\d{5}` denotes exactly 5 digits, the same as `pattern:\d\d\d\d\d`.
The example below looks for a 5-digit number:
```js run
alert( "I'm 12345 years old".match(/\d{5}/) ); // "12345"
```
We can add `\b` to exclude longer numbers: `pattern:\b\d{5}\b`.
The range: `{3,5}`, match 3-5 times
: To find numbers from 3 to 5 digits we can put the limits into curly braces: `pattern:\d{3,5}`
```js run
alert( "I'm not 12, but 1234 years old".match(/\d{3,5}/) ); // "1234"
```
We can omit the upper limit.
Then a regexp `pattern:\d{3,}` looks for sequences of digits of length `3` or more:
```js run
alert( "I'm not 12, but 345678 years old".match(/\d{3,}/) ); // "345678"
```
Let's return to the string `+7(903)-123-45-67`.
A number is a sequence of one or more digits in a row. So the regexp is `pattern:\d{1,}`:
```js run
let str = "+7(903)-123-45-67";
let numbers = str.match(/\d{1,}/g);
alert(numbers); // 7,903,123,45,67
```
## Shorthands
There are shorthands for most used quantifiers:
`+`
: Means "one or more", the same as `{1,}`.
For instance, `pattern:\d+` looks for numbers:
```js run
let str = "+7(903)-123-45-67";
alert( str.match(/\d+/g) ); // 7,903,123,45,67
```
`?`
: Means "zero or one", the same as `{0,1}`. In other words, it makes the symbol optional.
For instance, the pattern `pattern:ou?r` looks for `match:o` followed by zero or one `match:u`, and then `match:r`.
So, `pattern:colou?r` finds both `match:color` and `match:colour`:
```js run
let str = "Should I write color or colour?";
alert( str.match(/colou?r/g) ); // color, colour
```
`*`
: Means "zero or more", the same as `{0,}`. That is, the character may repeat any times or be absent.
For example, `pattern:\d0*` looks for a digit followed by any number of zeroes:
```js run
alert( "100 10 1".match(/\d0*/g) ); // 100, 10, 1
```
Compare it with `'+'` (one or more):
```js run
alert( "100 10 1".match(/\d0+/g) ); // 100, 10
// 1 not matched, as 0+ requires at least one zero
```
## More examples
Quantifiers are used very often. They serve as the main "building block" of complex regular expressions, so let's see more examples.
Regexp "decimal fraction" (a number with a floating point): `pattern:\d+\.\d+`
: In action:
```js run
alert( "0 1 12.345 7890".match(/\d+\.\d+/g) ); // 12.345
```
Regexp "open HTML-tag without attributes", like `<span>` or `<p>`: `pattern:/<[a-z]+>/i`
: In action:
```js run
alert( "<body> ... </body>".match(/<[a-z]+>/gi) ); // <body>
```
We look for character `pattern:'<'` followed by one or more English letters, and then `pattern:'>'`.
Regexp "open HTML-tag without attributes" (improved): `pattern:/<[a-z][a-z0-9]*>/i`
: Better regexp: according to the standard, HTML tag name may have a digit at any position except the first one, like `<h1>`.
```js run
alert( "<h1>Hi!</h1>".match(/<[a-z][a-z0-9]*>/gi) ); // <h1>
```
Regexp "opening or closing HTML-tag without attributes": `pattern:/<\/?[a-z][a-z0-9]*>/i`
: We added an optional slash `pattern:/?` before the tag. Had to escape it with a backslash, otherwise JavaScript would think it is the pattern end.
```js run
alert( "<h1>Hi!</h1>".match(/<\/?[a-z][a-z0-9]*>/gi) ); // <h1>, </h1>
```
```smart header="To make a regexp more precise, we often need make it more complex"
We can see one common rule in these examples: the more precise is the regular expression -- the longer and more complex it is.
For instance, for HTML tags we could use a simpler regexp: `pattern:<\w+>`.
...But because `pattern:\w` means any English letter or a digit or `'_'`, the regexp also matches non-tags, for instance `match:<_>`. So it's much simpler than `pattern:<[a-z][a-z0-9]*>`, but less reliable.
Are we ok with `pattern:<\w+>` or we need `pattern:<[a-z][a-z0-9]*>`?
In real life both variants are acceptable. Depends on how tolerant we can be to "extra" matches and whether it's difficult or not to filter them out by other means.
```

View file

@ -0,0 +1,6 @@
The result is: `match:123 4`.
First the lazy `pattern:\d+?` tries to take as little digits as it can, but it has to reach the space, so it takes `match:123`.
Then the second `\d+?` takes only one digit, because that's enough.

View file

@ -0,0 +1,7 @@
# A match for /d+? d+?/
What's the match here?
```js
"123 456".match(/\d+? \d+?/g) ); // ?
```

View file

@ -0,0 +1,17 @@
We need to find the beginning of the comment `match:<!--`, then everything till the end of `match:-->`.
The first idea could be `pattern:<!--.*?-->` -- the lazy quantifier makes the dot stop right before `match:-->`.
But a dot in Javascript means "any symbol except the newline". So multiline comments won't be found.
We can use `pattern:[\s\S]` instead of the dot to match "anything":
```js run
let reg = /<!--[\s\S]*?-->/g;
let str = `... <!-- My -- comment
test --> .. <!----> ..
`;
alert( str.match(reg) ); // '<!-- My -- comment \n test -->', '<!---->'
```

View file

@ -0,0 +1,13 @@
# Find HTML comments
Find all HTML comments in the text:
```js
let reg = /your regexp/g;
let str = `... <!-- My -- comment
test --> .. <!----> ..
`;
alert( str.match(reg) ); // '<!-- My -- comment \n test -->', '<!---->'
```

View file

@ -0,0 +1,10 @@
The solution is `pattern:<[^<>]+>`.
```js run
let reg = /<[^<>]+>/g;
let str = '<> <a href="/"> <input type="radio" checked> <b>';
alert( str.match(reg) ); // '<a href="/">', '<input type="radio" checked>', '<b>'
```

View file

@ -0,0 +1,15 @@
# Find HTML tags
Create a regular expression to find all (opening and closing) HTML tags with their attributes.
An example of use:
```js run
let reg = /your regexp/g;
let str = '<> <a href="/"> <input type="radio" checked> <b>';
alert( str.match(reg) ); // '<a href="/">', '<input type="radio" checked>', '<b>'
```
Let's assume that may not contain `<` and `>` inside (in quotes too), that simplifies things a bit.

View file

@ -0,0 +1,304 @@
# Greedy and lazy quantifiers
Quantifiers are very simple from the first sight, but in fact they can be tricky.
We should understand how the search works very well if we plan to look for something more complex than `pattern:/\d+/`.
Let's take the following task as an example.
We have a text and need to replace all quotes `"..."` with guillemet marks: `«...»`. They are preferred for typography in many countries.
For instance: `"Hello, world"` should become `«Hello, world»`. Some countries prefer other quotes, like `„Witam, świat!”` (Polish) or `「你好,世界」` (Chinese), but for our task let's choose `«...»`.
The first thing to do is to locate quoted strings, and then we can replace them.
A regular expression like `pattern:/".+"/g` (a quote, then something, then the other quote) may seem like a good fit, but it isn't!
Let's try it:
```js run
let reg = /".+"/g;
let str = 'a "witch" and her "broom" is one';
alert( str.match(reg) ); // "witch" and her "broom"
```
...We can see that it works not as intended!
Instead of finding two matches `match:"witch"` and `match:"broom"`, it finds one: `match:"witch" and her "broom"`.
That can be described as "greediness is the cause of all evil".
## Greedy search
To find a match, the regular expression engine uses the following algorithm:
- For every position in the string
- Match the pattern at that position.
- If there's no match, go to the next position.
These common words do not make it obvious why the regexp fails, so let's elaborate how the search works for the pattern `pattern:".+"`.
1. The first pattern character is a quote `pattern:"`.
The regular expression engine tries to find it at the zero position of the source string `subject:a "witch" and her "broom" is one`, but there's `subject:a` there, so there's immediately no match.
Then it advances: goes to the next positions in the source string and tries to find the first character of the pattern there, and finally finds the quote at the 3rd position:
![](witch_greedy1.png)
2. The quote is detected, and then the engine tries to find a match for the rest of the pattern. It tries to see if the rest of the subject string conforms to `pattern:.+"`.
In our case the next pattern character is `pattern:.` (a dot). It denotes "any character except a newline", so the next string letter `match:'w'` fits:
![](witch_greedy2.png)
3. Then the dot repeats because of the quantifier `pattern:.+`. The regular expression engine builds the match by taking characters one by one while it is possible.
...When it becomes impossible? All characters match the dot, so it only stops when it reaches the end of the string:
![](witch_greedy3.png)
4. Now the engine finished repeating for `pattern:.+` and tries to find the next character of the pattern. It's the quote `pattern:"`. But there's a problem: the string has finished, there are no more characters!
The regular expression engine understands that it took too many `pattern:.+` and starts to *backtrack*.
In other words, it shortens the match for the quantifier by one character:
![](witch_greedy4.png)
Now it assumes that `pattern:.+` ends one character before the end and tries to match the rest of the pattern from that position.
If there were a quote there, then that would be the end, but the last character is `subject:'e'`, so there's no match.
5. ...So the engine decreases the number of repetitions of `pattern:.+` by one more character:
![](witch_greedy5.png)
The quote `pattern:'"'` does not match `subject:'n'`.
6. The engine keep backtracking: it decreases the count of repetition for `pattern:'.'` until the rest of the pattern (in our case `pattern:'"'`) matches:
![](witch_greedy6.png)
7. The match is complete.
8. So the first match is `match:"witch" and her "broom"`. The further search starts where the first match ends, but there are no more quotes in the rest of the string `subject:is one`, so no more results.
That's probably not what we expected, but that's how it works.
**In the greedy mode (by default) the quantifier is repeated as many times as possible.**
The regexp engine tries to fetch as many characters as it can by `pattern:.+`, and then shortens that one by one.
For our task we want another thing. That's what the lazy quantifier mode is for.
## Lazy mode
The lazy mode of quantifier is an opposite to the greedy mode. It means: "repeat minimal number of times".
We can enable it by putting a question mark `pattern:'?'` after the quantifier, so that it becomes `pattern:*?` or `pattern:+?` or even `pattern:??` for `pattern:'?'`.
To make things clear: usually a question mark `pattern:?` is a quantifier by itself (zero or one), but if added *after another quantifier (or even itself)* it gets another meaning -- it switches the matching mode from greedy to lazy.
The regexp `pattern:/".+?"/g` works as intended: it finds `match:"witch"` and `match:"broom"`:
```js run
let reg = /".+?"/g;
let str = 'a "witch" and her "broom" is one';
alert( str.match(reg) ); // witch, broom
```
To clearly understand the change, let's trace the search step by step.
1. The first step is the same: it finds the pattern start `pattern:'"'` at the 3rd position:
![](witch_greedy1.png)
2. The next step is also similar: the engine finds a match for the dot `pattern:'.'`:
![](witch_greedy2.png)
3. And now the search goes differently. Because we have a lazy mode for `pattern:+?`, the engine doesn't try to match a dot one more time, but stops and tries to match the rest of the pattern `pattern:'"'` right now:
![](witch_lazy3.png)
If there were a quote there, then the search would end, but there's `'i'`, so there's no match.
4. Then the regular expression engine increases the number of repetitions for the dot and tries one more time:
![](witch_lazy4.png)
Failure again. Then the number of repetitions is increased again and again...
5. ...Till the match for the rest of the pattern is found:
![](witch_lazy5.png)
6. The next search starts from the end of the current match and yield one more result:
![](witch_lazy6.png)
In this example we saw how the lazy mode works for `pattern:+?`. Quantifiers `pattern:+?` and `pattern:??` work the similar way -- the regexp engine increases the number of repetitions only if the rest of the pattern can't match on the given position.
**Laziness is only enabled for the quantifier with `?`.**
Other quantifiers remain greedy.
For instance:
```js run
alert( "123 456".match(/\d+ \d+?/g) ); // 123 4
```
1. The pattern `pattern:\d+` tries to match as many numbers as it can (greedy mode), so it finds `match:123` and stops, because the next character is a space `pattern:' '`.
2. Then there's a space in pattern, it matches.
3. Then there's `pattern:\d+?`. The quantifier is in lazy mode, so it finds one digit `match:4` and tries to check if the rest of the pattern matches from there.
...But there's nothing in the pattern after `pattern:\d+?`.
The lazy mode doesn't repeat anything without a need. The pattern finished, so we're done. We have a match `match:123 4`.
4. The next search starts from the character `5`.
```smart header="Optimizations"
Modern regular expression engines can optimize internal algorithms to work faster. So they may work a bit different from the described algorithm.
But to understand how regular expressions work and to build regular expressions, we don't need to know about that. They are only used internally to optimize things.
Complex regular expressions are hard to optimize, so the search may work exactly as described as well.
```
## Alternative approach
With regexps, there's often more than one way to do the same thing.
In our case we can find quoted strings without lazy mode using the regexp `pattern:"[^"]+"`:
```js run
let reg = /"[^"]+"/g;
let str = 'a "witch" and her "broom" is one';
alert( str.match(reg) ); // witch, broom
```
The regexp `pattern:"[^"]+"` gives correct results, because it looks for a quote `pattern:'"'` followed by one or more non-quotes `pattern:[^"]`, and then the closing quote.
When the regexp engine looks for `pattern:[^"]+` it stops the repetitions when it meets the closing quote, and we're done.
Please note, that this logic does not replace lazy quantifiers!
It is just different. There are times when we need one or another.
**Let's see an example where lazy quantifiers fail and this variant works right.**
For instance, we want to find links of the form `<a href="..." class="doc">`, with any `href`.
Which regular expression to use?
The first idea might be: `pattern:/<a href=".*" class="doc">/g`.
Let's check it:
```js run
let str = '...<a href="link" class="doc">...';
let reg = /<a href=".*" class="doc">/g;
// Works!
alert( str.match(reg) ); // <a href="link" class="doc">
```
It worked. But let's see what happens if there are many links in the text?
```js run
let str = '...<a href="link1" class="doc">... <a href="link2" class="doc">...';
let reg = /<a href=".*" class="doc">/g;
// Whoops! Two links in one match!
alert( str.match(reg) ); // <a href="link1" class="doc">... <a href="link2" class="doc">
```
Now the result is wrong for the same reason as our "witches" example. The quantifier `pattern:.*` took too many characters.
The match looks like this:
```html
<a href="....................................." class="doc">
<a href="link1" class="doc">... <a href="link2" class="doc">
```
Let's modify the pattern by making the quantifier `pattern:.*?` lazy:
```js run
let str = '...<a href="link1" class="doc">... <a href="link2" class="doc">...';
let reg = /<a href=".*?" class="doc">/g;
// Works!
alert( str.match(reg) ); // <a href="link1" class="doc">, <a href="link2" class="doc">
```
Now it seems to work, there are two matches:
```html
<a href="....." class="doc"> <a href="....." class="doc">
<a href="link1" class="doc">... <a href="link2" class="doc">
```
...But let's test it on one more text input:
```js run
let str = '...<a href="link1" class="wrong">... <p style="" class="doc">...';
let reg = /<a href=".*?" class="doc">/g;
// Wrong match!
alert( str.match(reg) ); // <a href="link1" class="wrong">... <p style="" class="doc">
```
Now it fails. The match includes not just a link, but also a lot of text after it, including `<p...>`.
Why?
That's what's going on:
1. First the regexp finds a link start `match:<a href="`.
2. Then it looks for `pattern:.*?`: takes one character (lazily!), check if there's a match for `pattern:" class="doc">` (none).
3. Then takes another character into `pattern:.*?`, and so on... until it finally reaches `match:" class="doc">`.
But the problem is: that's already beyound the link, in another tag `<p>`. Not what we want.
Here's the picture of the match aligned with the text:
```html
<a href="..................................." class="doc">
<a href="link1" class="wrong">... <p style="" class="doc">
```
So the laziness did not work for us here.
We need the pattern to look for `<a href="...something..." class="doc">`, but both greedy and lazy variants have problems.
The correct variant would be: `pattern:href="[^"]*"`. It will take all characters inside the `href` attribute till the nearest quote, just what we need.
A working example:
```js run
let str1 = '...<a href="link1" class="wrong">... <p style="" class="doc">...';
let str2 = '...<a href="link1" class="doc">... <a href="link2" class="doc">...';
let reg = /<a href="[^"]*" class="doc">/g;
// Works!
alert( str1.match(reg) ); // null, no matches, that's correct
alert( str2.match(reg) ); // <a href="link1" class="doc">, <a href="link2" class="doc">
```
## Summary
Quantifiers have two modes of work:
Greedy
: By default the regular expression engine tries to repeat the quantifier as many times as possible. For instance, `pattern:\d+` consumes all possible digits. When it becomes impossible to consume more (no more digits or string end), then it continues to match the rest of the pattern. If there's no match then it decreases the number of repetitions (backtracks) and tries again.
Lazy
: Enabled by the question mark `pattern:?` after the quantifier. The regexp engine tries to match the rest of the pattern before each repetition of the quantifier.
As we've seen, the lazy mode is not a "panacea" from the greedy search. An alternative is a "fine-tuned" greedy search, with exclusions. Soon we'll see more examples of it.

Binary file not shown.

After

Width:  |  Height:  |  Size: 8.1 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 16 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 7.8 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 15 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 10 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 20 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 10 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 20 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 10 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 20 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 9.6 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 18 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 7.9 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 15 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 8.1 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 15 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 8.3 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 16 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 9.1 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 17 KiB

View file

@ -0,0 +1,29 @@
A regexp to search 3-digit color `#abc`: `pattern:/#[a-f0-9]{3}/i`.
We can add exactly 3 more optional hex digits. We don't need more or less. Either we have them or we don't.
The simplest way to add them -- is to append to the regexp: `pattern:/#[a-f0-9]{3}([a-f0-9]{3})?/i`
We can do it in a smarter way though: `pattern:/#([a-f0-9]{3}){1,2}/i`.
Here the regexp `pattern:[a-f0-9]{3}` is in parentheses to apply the quantifier `pattern:{1,2}` to it as a whole.
In action:
```js run
let reg = /#([a-f0-9]{3}){1,2}/gi;
let str = "color: #3f3; background-color: #AA00ef; and: #abcd";
alert( str.match(reg) ); // #3f3 #AA00ef #abc
```
There's a minor problem here: the pattern found `match:#abc` in `subject:#abcd`. To prevent that we can add `pattern:\b` to the end:
```js run
let reg = /#([a-f0-9]{3}){1,2}\b/gi;
let str = "color: #3f3; background-color: #AA00ef; and: #abcd";
alert( str.match(reg) ); // #3f3 #AA00ef
```

View file

@ -0,0 +1,14 @@
# Find color in the format #abc or #abcdef
Write a RegExp that matches colors in the format `#abc` or `#abcdef`. That is: `#` followed by 3 or 6 hexadecimal digits.
Usage example:
```js
let reg = /your regexp/g;
let str = "color: #3f3; background-color: #AA00ef; and: #abcd";
alert( str.match(reg) ); // #3f3 #AA00ef
```
P.S. This should be exactly 3 or 6 hex digits: values like `#abcd` should not match.

View file

@ -0,0 +1,18 @@
An non-negative integer number is `pattern:\d+`. We should exclude `0` as the first digit, as we don't need zero, but we can allow it in further digits.
So that gives us `pattern:[1-9]\d*`.
A decimal part is: `pattern:\.\d+`.
Because the decimal part is optional, let's put it in parentheses with the quantifier `pattern:'?'`.
Finally we have the regexp: `pattern:[1-9]\d*(\.\d+)?`:
```js run
let reg = /[1-9]\d*(\.\d+)?/g;
let str = "1.5 0 -5 12. 123.4.";
alert( str.match(reg) ); // 1.5, 0, 12, 123.4
```

View file

@ -0,0 +1,12 @@
# Find positive numbers
Create a regexp that looks for positive numbers, including those without a decimal point.
An example of use:
```js
let reg = /your regexp/g;
let str = "1.5 0 -5 12. 123.4.";
alert( str.match(reg) ); // 1.5, 12, 123.4 (ignores 0 and -5)
```

View file

@ -0,0 +1,11 @@
A positive number with an optional decimal part is (per previous task): `pattern:\d+(\.\d+)?`.
Let's add an optional `-` in the beginning:
```js run
let reg = /-?\d+(\.\d+)?/g;
let str = "-1.5 0 2 -123.4.";
alert( str.match(reg) ); // -1.5, 0, 2, -123.4
```

View file

@ -0,0 +1,13 @@
# Find all numbers
Write a regexp that looks for all decimal numbers including integer ones, with the floating point and negative ones.
An example of use:
```js
let reg = /your regexp/g;
let str = "-1.5 0 2 -123.4.";
alert( str.match(re) ); // -1.5, 0, 2, -123.4
```

View file

@ -0,0 +1,51 @@
A regexp for a number is: `pattern:-?\d+(\.\d+)?`. We created it in previous tasks.
An operator is `pattern:[-+*/]`. We put the dash `pattern:-` first, because in the middle it would mean a character range, we don't need that.
Note that a slash should be escaped inside a JavaScript regexp `pattern:/.../`.
We need a number, an operator, and then another number. And optional spaces between them.
The full regular expression: `pattern:-?\d+(\.\d+)?\s*[-+*/]\s*-?\d+(\.\d+)?`.
To get a result as an array let's put parentheses around the data that we need: numbers and the operator: `pattern:(-?\d+(\.\d+)?)\s*([-+*/])\s*(-?\d+(\.\d+)?)`.
In action:
```js run
let reg = /(-?\d+(\.\d+)?)\s*([-+*\/])\s*(-?\d+(\.\d+)?)/;
alert( "1.2 + 12".match(reg) );
```
The result includes:
- `result[0] == "1.2 + 12"` (full match)
- `result[1] == "1.2"` (first group `(-?\d+(\.\d+)?)` -- the first number, including the decimal part)
- `result[2] == ".2"` (second group`(\.\d+)?` -- the first decimal part)
- `result[3] == "+"` (third group `([-+*\/])` -- the operator)
- `result[4] == "12"` (forth group `(-?\d+(\.\d+)?)` -- the second number)
- `result[5] == undefined` (fifth group `(\.\d+)?` -- the last decimal part is absent, so it's undefined)
We only want the numbers and the operator, without the full match or the decimal parts.
The full match (the arrays first item) can be removed by shifting the array `pattern:result.shift()`.
The decimal groups can be removed by making them into non-capturing groups, by adding `pattern:?:` to the beginning: `pattern:(?:\.\d+)?`.
The final solution:
```js run
function parse(expr) {
let reg = /(-?\d+(?:\.\d+)?)\s*([-+*\/])\s*(-?\d+(?:\.\d+)?)/;
let result = expr.match(reg);
if (!result) return [];
result.shift();
return result;
}
alert( parse("-1.23 * 3.45") ); // -1.23, *, 3.45
```

View file

@ -0,0 +1,28 @@
# Parse an expression
An arithmetical expression consists of 2 numbers and an operator between them, for instance:
- `1 + 2`
- `1.2 * 3.4`
- `-3 / -6`
- `-2 - 2`
The operator is one of: `"+"`, `"-"`, `"*"` or `"/"`.
There may be extra spaces at the beginning, at the end or between the parts.
Create a function `parse(expr)` that takes an expression and returns an array of 3 items:
1. The first number.
2. The operator.
3. The second number.
For example:
```js
let [a, op, b] = parse("1.2 * 3.4");
alert(a); // 1.2
alert(op); // *
alert(b); // 3.4
```

View file

@ -0,0 +1,237 @@
# Capturing groups
A part of a pattern can be enclosed in parentheses `pattern:(...)`. This is called a "capturing group".
That has two effects:
1. It allows to place a part of the match into a separate array.
2. If we put a quantifier after the parentheses, it applies to the parentheses as a whole, not the last character.
## Example
In the example below the pattern `pattern:(go)+` finds one or more `match:'go'`:
```js run
alert( 'Gogogo now!'.match(/(go)+/i) ); // "Gogogo"
```
Without parentheses, the pattern `pattern:/go+/` means `subject:g`, followed by `subject:o` repeated one or more times. For instance, `match:goooo` or `match:gooooooooo`.
Parentheses group the word `pattern:(go)` together.
Let's make something more complex -- a regexp to match an email.
Examples of emails:
```
my@mail.com
john.smith@site.com.uk
```
The pattern: `pattern:[-.\w]+@([\w-]+\.)+[\w-]{2,20}`.
1. The first part `pattern:[-.\w]+` (before `@`) may include any alphanumeric word characters, a dot and a dash, to match `match:john.smith`.
2. Then `pattern:@`, and the domain. It may be a subdomain like `host.site.com.uk`, so we match it as "a word followed by a dot `pattern:([\w-]+\.)` (repeated), and then the last part must be a word: `match:com` or `match:uk` (but not very long: 2-20 characters).
That regexp is not perfect, but good enough to fix errors or occasional mistypes.
For instance, we can find all emails in the string:
```js run
let reg = /[-.\w]+@([\w-]+\.)+[\w-]{2,20}/g;
alert("my@mail.com @ his@site.com.uk".match(reg)); // my@mail.com, his@site.com.uk
```
In this example parentheses were used to make a group for repeating `pattern:(...)+`. But there are other uses too, let's see them.
## Contents of parentheses
Parentheses are numbered from left to right. The search engine remembers the content of each and allows to reference it in the pattern or in the replacement string.
For instance, we'd like to find HTML tags `pattern:<.*?>`, and process them.
Let's wrap the inner content into parentheses, like this: `pattern:<(.*?)>`.
We'll get them into an array:
```js run
let str = '<h1>Hello, world!</h1>';
let reg = /<(.*?)>/;
alert( str.match(reg) ); // Array: ["<h1>", "h1"]
```
The call to [String#match](mdn:js/String/match) returns groups only if the regexp has no `pattern:/.../g` flag.
If we need all matches with their groups then we can use `.matchAll` or `regexp.exec` as described in <info:regexp-methods>:
```js run
let str = '<h1>Hello, world!</h1>';
// two matches: opening <h1> and closing </h1> tags
let reg = /<(.*?)>/g;
let matches = Array.from( str.matchAll(reg) );
alert(matches[0]); // Array: ["<h1>", "h1"]
alert(matches[1]); // Array: ["</h1>", "/h1"]
```
Here we have two matches for `pattern:<(.*?)>`, each of them is an array with the full match and groups.
## Nested groups
Parentheses can be nested. In this case the numbering also goes from left to right.
For instance, when searching a tag in `subject:<span class="my">` we may be interested in:
1. The tag content as a whole: `match:span class="my"`.
2. The tag name: `match:span`.
3. The tag attributes: `match:class="my"`.
Let's add parentheses for them:
```js run
let str = '<span class="my">';
let reg = /<(([a-z]+)\s*([^>]*))>/;
let result = str.match(reg);
alert(result); // <span class="my">, span class="my", span, class="my"
```
Here's how groups look:
![](regexp-nested-groups.png)
At the zero index of the `result` is always the full match.
Then groups, numbered from left to right. Whichever opens first gives the first group `result[1]`. Here it encloses the whole tag content.
Then in `result[2]` goes the group from the second opening `pattern:(` till the corresponding `pattern:)` -- tag name, then we don't group spaces, but group attributes for `result[3]`.
**If a group is optional and doesn't exist in the match, the corresponding `result` index is present (and equals `undefined`).**
For instance, let's consider the regexp `pattern:a(z)?(c)?`. It looks for `"a"` optionally followed by `"z"` optionally followed by `"c"`.
If we run it on the string with a single letter `subject:a`, then the result is:
```js run
let match = 'a'.match(/a(z)?(c)?/);
alert( match.length ); // 3
alert( match[0] ); // a (whole match)
alert( match[1] ); // undefined
alert( match[2] ); // undefined
```
The array has the length of `3`, but all groups are empty.
And here's a more complex match for the string `subject:ack`:
```js run
let match = 'ack'.match(/a(z)?(c)?/)
alert( match.length ); // 3
alert( match[0] ); // ac (whole match)
alert( match[1] ); // undefined, because there's nothing for (z)?
alert( match[2] ); // c
```
The array length is permanent: `3`. But there's nothing for the group `pattern:(z)?`, so the result is `["ac", undefined, "c"]`.
## Named groups
Remembering groups by their numbers is hard. For simple patterns it's doable, but for more complex ones we can give names to parentheses.
That's done by putting `pattern:?<name>` immediately after the opening paren, like this:
```js run
*!*
let dateRegexp = /(?<year>[0-9]{4})-(?<month>[0-9]{2})-(?<day>[0-9]{2})/;
*/!*
let str = "2019-04-30";
let groups = str.match(dateRegexp).groups;
alert(groups.year); // 2019
alert(groups.month); // 04
alert(groups.day); // 30
```
As you can see, the groups reside in the `.groups` property of the match.
Wee can also use them in replacements, as `pattern:$<name>` (like `$1..9`, but name instead of a digit).
For instance, let's rearrange the date into `day.month.year`:
```js run
let dateRegexp = /(?<year>[0-9]{4})-(?<month>[0-9]{2})-(?<day>[0-9]{2})/;
let str = "2019-04-30";
let rearranged = str.replace(dateRegexp, '$<day>.$<month>.$<year>');
alert(rearranged); // 30.04.2019
```
If we use a function, then named `groups` object is always the last argument:
```js run
let dateRegexp = /(?<year>[0-9]{4})-(?<month>[0-9]{2})-(?<day>[0-9]{2})/;
let str = "2019-04-30";
let rearranged = str.replace(dateRegexp,
(str, year, month, day, offset, input, groups) =>
`${groups.day}.${groups.month}.${groups.year}`
);
alert(rearranged); // 30.04.2019
```
Usually, when we intend to use named groups, we don't need positional arguments of the function. For the majority of real-life cases we only need `str` and `groups`.
So we can write it a little bit shorter:
```js
let rearranged = str.replace(dateRegexp, (str, ...args) => {
let {year, month, day} = args.pop();
alert(str); // 2019-04-30
alert(year); // 2019
alert(month); // 04
alert(day); // 30
});
```
## Non-capturing groups with ?:
Sometimes we need parentheses to correctly apply a quantifier, but we don't want the contents in results.
A group may be excluded by adding `pattern:?:` in the beginning.
For instance, if we want to find `pattern:(go)+`, but don't want to remember the contents (`go`) in a separate array item, we can write: `pattern:(?:go)+`.
In the example below we only get the name "John" as a separate member of the `results` array:
```js run
let str = "Gogo John!";
*!*
// exclude Gogo from capturing
let reg = /(?:go)+ (\w+)/i;
*/!*
let result = str.match(reg);
alert( result.length ); // 2
alert( result[1] ); // John
```
## Summary
- Parentheses can be:
- capturing `(...)`, ordered left-to-right, accessible by number.
- named capturing `(?<name>...)`, accessible by name.
- non-capturing `(?:...)`, used only to apply quantifier to the whole groups.

Binary file not shown.

After

Width:  |  Height:  |  Size: 12 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 25 KiB

View file

@ -0,0 +1,65 @@
# Backreferences in pattern: \n and \k
Capturing groups can be accessed not only in the result or in the replacement string, but also in the pattern itself.
## Backreference by number: \n
A group can be referenced in the pattern using `\n`, where `n` is the group number.
To make things clear let's consider a task.
We need to find a quoted string: either a single-quoted `subject:'...'` or a double-quoted `subject:"..."` -- both variants need to match.
How to look for them?
We can put two kinds of quotes in the pattern: `pattern:['"](.*?)['"]`, but it would find strings with mixed quotes, like `match:"...'` and `match:'..."`. That would lead to incorrect matches when one quote appears inside other ones, like the string `subject:"She's the one!"`:
```js run
let str = `He said: "She's the one!".`;
let reg = /['"](.*?)['"]/g;
// The result is not what we expect
alert( str.match(reg) ); // "She'
```
As we can see, the pattern found an opening quote `match:"`, then the text is consumed lazily till the other quote `match:'`, that closes the match.
To make sure that the pattern looks for the closing quote exactly the same as the opening one, we can make a groups of it and use the backreference.
Here's the correct code:
```js run
let str = `He said: "She's the one!".`;
*!*
let reg = /(['"])(.*?)\1/g;
*/!*
alert( str.match(reg) ); // "She's the one!"
```
Now it works! The regular expression engine finds the first quote `pattern:(['"])` and remembers the content of `pattern:(...)`, that's the first capturing group.
Further in the pattern `pattern:\1` means "find the same text as in the first group", exactly the same quote in our case.
Please note:
- To reference a group inside a replacement string -- we use `$1`, while in the pattern -- a backslash `\1`.
- If we use `?:` in the group, then we can't reference it. Groups that are excluded from capturing `(?:...)` are not remembered by the engine.
## Backreference by name: `\k<name>`
For named groups, we can backreference by `\k<name>`.
The same example with the named group:
```js run
let str = `He said: "She's the one!".`;
*!*
let reg = /(?<quote>['"])(.*?)\k<quote>/g;
*/!*
alert( str.match(reg) ); // "She's the one!"
```

View file

@ -0,0 +1,33 @@
The first idea can be to list the languages with `|` in-between.
But that doesn't work right:
```js run
let reg = /Java|JavaScript|PHP|C|C\+\+/g;
let str = "Java, JavaScript, PHP, C, C++";
alert( str.match(reg) ); // Java,Java,PHP,C,C
```
The regular expression engine looks for alternations one-by-one. That is: first it checks if we have `match:Java`, otherwise -- looks for `match:JavaScript` and so on.
As a result, `match:JavaScript` can never be found, just because `match:Java` is checked first.
The same with `match:C` and `match:C++`.
There are two solutions for that problem:
1. Change the order to check the longer match first: `pattern:JavaScript|Java|C\+\+|C|PHP`.
2. Merge variants with the same start: `pattern:Java(Script)?|C(\+\+)?|PHP`.
In action:
```js run
let reg = /Java(Script)?|C(\+\+)?|PHP/g;
let str = "Java, JavaScript, PHP, C, C++";
alert( str.match(reg) ); // Java,JavaScript,PHP,C,C++
```

View file

@ -0,0 +1,11 @@
# Find programming languages
There are many programming languages, for instance Java, JavaScript, PHP, C, C++.
Create a regexp that finds them in the string `subject:Java JavaScript PHP C++ C`:
```js
let reg = /your regexp/g;
alert("Java JavaScript PHP C++ C".match(reg)); // Java JavaScript PHP C++ C
```

View file

@ -0,0 +1,23 @@
Opening tag is `pattern:\[(b|url|quote)\]`.
Then to find everything till the closing tag -- let's the pattern `pattern:[\s\S]*?` to match any character including the newline and then a backreference to the closing tag.
The full pattern: `pattern:\[(b|url|quote)\][\s\S]*?\[/\1\]`.
In action:
```js run
let reg = /\[(b|url|quote)\][\s\S]*?\[\/\1\]/g;
let str = `
[b]hello![/b]
[quote]
[url]http://google.com[/url]
[/quote]
`;
alert( str.match(reg) ); // [b]hello![/b],[quote][url]http://google.com[/url][/quote]
```
Please note that we had to escape a slash for the closing tag `pattern:[/\1]`, because normally the slash closes the pattern.

View file

@ -0,0 +1,48 @@
# Find bbtag pairs
A "bb-tag" looks like `[tag]...[/tag]`, where `tag` is one of: `b`, `url` or `quote`.
For instance:
```
[b]text[/b]
[url]http://google.com[/url]
```
BB-tags can be nested. But a tag can't be nested into itself, for instance:
```
Normal:
[url] [b]http://google.com[/b] [/url]
[quote] [b]text[/b] [/quote]
Impossible:
[b][b]text[/b][/b]
```
Tags can contain line breaks, that's normal:
```
[quote]
[b]text[/b]
[/quote]
```
Create a regexp to find all BB-tags with their contents.
For instance:
```js
let reg = /your regexp/g;
let str = "..[url]http://google.com[/url]..";
alert( str.match(reg) ); // [url]http://google.com[/url]
```
If tags are nested, then we need the outer tag (if we want we can continue the search in its content):
```js
let reg = /your regexp/g;
let str = "..[url][b]http://google.com[/b][/url]..";
alert( str.match(reg) ); // [url][b]http://google.com[/b][/url]
```

View file

@ -0,0 +1,17 @@
The solution: `pattern:/"(\\.|[^"\\])*"/g`.
Step by step:
- First we look for an opening quote `pattern:"`
- Then if we have a backslash `pattern:\\` (we technically have to double it in the pattern, because it is a special character, so that's a single backslash in fact), then any character is fine after it (a dot).
- Otherwise we take any character except a quote (that would mean the end of the string) and a backslash (to prevent lonely backslashes, the backslash is only used with some other symbol after it): `pattern:[^"\\]`
- ...And so on till the closing quote.
In action:
```js run
let reg = /"(\\.|[^"\\])*"/g;
let str = ' .. "test me" .. "Say \\"Hello\\"!" .. "\\\\ \\"" .. ';
alert( str.match(reg) ); // "test me","Say \"Hello\"!","\\ \""
```

View file

@ -0,0 +1,32 @@
# Find quoted strings
Create a regexp to find strings in double quotes `subject:"..."`.
The important part is that strings should support escaping, in the same way as JavaScript strings do. For instance, quotes can be inserted as `subject:\"` a newline as `subject:\n`, and the slash itself as `subject:\\`.
```js
let str = "Just like \"here\".";
```
For us it's important that an escaped quote `subject:\"` does not end a string.
So we should look from one quote to the other ignoring escaped quotes on the way.
That's the essential part of the task, otherwise it would be trivial.
Examples of strings to match:
```js
.. *!*"test me"*/!* ..
.. *!*"Say \"Hello\"!"*/!* ... (escaped quotes inside)
.. *!*"\\"*/!* .. (double slash inside)
.. *!*"\\ \""*/!* .. (double slash and an escaped quote inside)
```
In JavaScript we need to double the slashes to pass them right into the string, like this:
```js run
let str = ' .. "test me" .. "Say \\"Hello\\"!" .. "\\\\ \\"" .. ';
// the in-memory string
alert(str); // .. "test me" .. "Say \"Hello\"!" .. "\\ \"" ..
```

View file

@ -0,0 +1,16 @@
The pattern start is obvious: `pattern:<style`.
...But then we can't simply write `pattern:<style.*?>`, because `match:<styler>` would match it.
We need either a space after `match:<style` and then optionally something else or the ending `match:>`.
In the regexp language: `pattern:<style(>|\s.*?>)`.
In action:
```js run
let reg = /<style(>|\s.*?>)/g;
alert( '<style> <styler> <style test="...">'.match(reg) ); // <style>, <style test="...">
```

View file

@ -0,0 +1,13 @@
# Find the full tag
Write a regexp to find the tag `<style...>`. It should match the full tag: it may have no attributes `<style>` or have several of them `<style type="..." id="...">`.
...But the regexp should not match `<styler>`!
For instance:
```js
let reg = /your regexp/g;
alert( '<style> <styler> <style test="...">'.match(reg) ); // <style>, <style test="...">
```

View file

@ -0,0 +1,59 @@
# Alternation (OR) |
Alternation is the term in regular expression that is actually a simple "OR".
In a regular expression it is denoted with a vertical line character `pattern:|`.
For instance, we need to find programming languages: HTML, PHP, Java or JavaScript.
The corresponding regexp: `pattern:html|php|java(script)?`.
A usage example:
```js run
let reg = /html|php|css|java(script)?/gi;
let str = "First HTML appeared, then CSS, then JavaScript";
alert( str.match(reg) ); // 'HTML', 'CSS', 'JavaScript'
```
We already know a similar thing -- square brackets. They allow to choose between multiple character, for instance `pattern:gr[ae]y` matches `match:gray` or `match:grey`.
Square brackets allow only characters or character sets. Alternation allows any expressions. A regexp `pattern:A|B|C` means one of expressions `A`, `B` or `C`.
For instance:
- `pattern:gr(a|e)y` means exactly the same as `pattern:gr[ae]y`.
- `pattern:gra|ey` means `match:gra` or `match:ey`.
To separate a part of the pattern for alternation we usually enclose it in parentheses, like this: `pattern:before(XXX|YYY)after`.
## Regexp for time
In previous chapters there was a task to build a regexp for searching time in the form `hh:mm`, for instance `12:00`. But a simple `pattern:\d\d:\d\d` is too vague. It accepts `25:99` as the time (99 seconds is valid, but shouldn't be).
How can we make a better one?
We can apply more careful matching. First, the hours:
- If the first digit is `0` or `1`, then the next digit can by anything.
- Or, if the first digit is `2`, then the next must be `pattern:[0-3]`.
As a regexp: `pattern:[01]\d|2[0-3]`.
Next, the minutes must be from `0` to `59`. In the regexp language that means `pattern:[0-5]\d`: the first digit `0-5`, and then any digit.
Let's glue them together into the pattern: `pattern:[01]\d|2[0-3]:[0-5]\d`.
We're almost done, but there's a problem. The alternation `pattern:|` now happens to be between `pattern:[01]\d` and `pattern:2[0-3]:[0-5]\d`.
That's wrong, as it should be applied only to hours `[01]\d` OR `2[0-3]`. That's a common mistake when starting to work with regular expressions.
The correct variant:
```js run
let reg = /([01]\d|2[0-3]):[0-5]\d/g;
alert("00:00 10:10 23:59 25:99 1:2".match(reg)); // 00:00,10:10,23:59
```

View file

@ -0,0 +1,6 @@
The empty string is the only match: it starts and immediately finishes.
The task once again demonstrates that anchors are not characters, but tests.
The string is empty `""`. The engine first matches the `pattern:^` (input start), yes it's there, and then immediately the end `pattern:$`, it's here too. So there's a match.

View file

@ -0,0 +1,3 @@
# Regexp ^$
Which string matches the pattern `pattern:^$`?

View file

@ -0,0 +1,21 @@
A two-digit hex number is `pattern:[0-9a-f]{2}` (assuming the `pattern:i` flag is enabled).
We need that number `NN`, and then `:NN` repeated 5 times (more numbers);
The regexp is: `pattern:[0-9a-f]{2}(:[0-9a-f]{2}){5}`
Now let's show that the match should capture all the text: start at the beginning and end at the end. That's done by wrapping the pattern in `pattern:^...$`.
Finally:
```js run
let reg = /^[0-9a-fA-F]{2}(:[0-9a-fA-F]{2}){5}$/i;
alert( reg.test('01:32:54:67:89:AB') ); // true
alert( reg.test('0132546789AB') ); // false (no colons)
alert( reg.test('01:32:54:67:89') ); // false (5 numbers, need 6)
alert( reg.test('01:32:54:67:89:ZZ') ) // false (ZZ in the end)
```

View file

@ -0,0 +1,20 @@
# Check MAC-address
[MAC-address](https://en.wikipedia.org/wiki/MAC_address) of a network interface consists of 6 two-digit hex numbers separated by a colon.
For instance: `subject:'01:32:54:67:89:AB'`.
Write a regexp that checks whether a string is MAC-address.
Usage:
```js
let reg = /your regexp/;
alert( reg.test('01:32:54:67:89:AB') ); // true
alert( reg.test('0132546789AB') ); // false (no colons)
alert( reg.test('01:32:54:67:89') ); // false (5 numbers, must be 6)
alert( reg.test('01:32:54:67:89:ZZ') ) // false (ZZ ad the end)
```

View file

@ -0,0 +1,55 @@
# String start ^ and finish $
The caret `pattern:'^'` and dollar `pattern:'$'` characters have special meaning in a regexp. They are called "anchors".
The caret `pattern:^` matches at the beginning of the text, and the dollar `pattern:$` -- in the end.
For instance, let's test if the text starts with `Mary`:
```js run
let str1 = "Mary had a little lamb, it's fleece was white as snow";
let str2 = 'Everywhere Mary went, the lamp was sure to go';
alert( /^Mary/.test(str1) ); // true
alert( /^Mary/.test(str2) ); // false
```
The pattern `pattern:^Mary` means: "the string start and then Mary".
Now let's test whether the text ends with an email.
To match an email, we can use a regexp `pattern:[-.\w]+@([\w-]+\.)+[\w-]{2,20}`.
To test whether the string ends with the email, let's add `pattern:$` to the pattern:
```js run
let reg = /[-.\w]+@([\w-]+\.)+[\w-]{2,20}$/g;
let str1 = 'My email is mail@site.com';
let str2 = 'Everywhere Mary went, the lamp was sure to go';
alert( reg.test(str1) ); // true
alert( reg.test(str2) ); // false
```
We can use both anchors together to check whether the string exactly follows the pattern. That's often used for validation.
For instance we want to check that `str` is exactly a color in the form `#` plus 6 hex digits. The pattern for the color is `pattern:#[0-9a-f]{6}`.
To check that the *whole string* exactly matches it, we add `pattern:^...$`:
```js run
let str = "#abcdef";
alert( /^#[0-9a-f]{6}$/i.test(str) ); // true
```
The regexp engine looks for the text start, then the color, and then immediately the text end. Just what we need.
```smart header="Anchors have zero length"
Anchors just like `\b` are tests. They have zero-width.
In other words, they do not match a character, but rather force the regexp engine to check the condition (text start/end).
```
The behavior of anchors changes if there's a flag `pattern:m` (multiline mode). We'll explore it in the next chapter.

View file

@ -0,0 +1,76 @@
# Multiline mode, flag "m"
The multiline mode is enabled by the flag `pattern:/.../m`.
It only affects the behavior of `pattern:^` and `pattern:$`.
In the multiline mode they match not only at the beginning and end of the string, but also at start/end of line.
## Line start ^
In the example below the text has multiple lines. The pattern `pattern:/^\d+/gm` takes a number from the beginning of each one:
```js run
let str = `1st place: Winnie
2nd place: Piglet
33rd place: Eeyore`;
*!*
alert( str.match(/^\d+/gm) ); // 1, 2, 33
*/!*
```
Without the flag `pattern:/.../m` only the first number is matched:
```js run
let str = `1st place: Winnie
2nd place: Piglet
33rd place: Eeyore`;
*!*
alert( str.match(/^\d+/g) ); // 1
*/!*
```
That's because by default a caret `pattern:^` only matches at the beginning of the text, and in the multiline mode -- at the start of a line.
The regular expression engine moves along the text and looks for a string start `pattern:^`, when finds -- continues to match the rest of the pattern `pattern:\d+`.
## Line end $
The dollar sign `pattern:$` behaves similarly.
The regular expression `pattern:\w+$` finds the last word in every line
```js run
let str = `1st place: Winnie
2nd place: Piglet
33rd place: Eeyore`;
alert( str.match(/\w+$/gim) ); // Winnie,Piglet,Eeyore
```
Without the `pattern:/.../m` flag the dollar `pattern:$` would only match the end of the whole string, so only the very last word would be found.
## Anchors ^$ versus \n
To find a newline, we can use not only `pattern:^` and `pattern:$`, but also the newline character `\n`.
The first difference is that unlike anchors, the character `\n` "consumes" the newline character and adds it to the result.
For instance, here we use it instead of `pattern:$`:
```js run
let str = `1st place: Winnie
2nd place: Piglet
33rd place: Eeyore`;
alert( str.match(/\w+\n/gim) ); // Winnie\n,Piglet\n
```
Here every match is a word plus a newline character.
And one more difference -- the newline `\n` does not match at the string end. That's why `Eeyore` is not found in the example above.
So, anchors are usually better, they are closer to what we want to get.

View file

@ -0,0 +1,105 @@
# Lookahead and lookbehind
Sometimes we need to match a pattern only if followed by another pattern. For instance, we'd like to get the price from a string like `subject:1 turkey costs 30€`.
We need a number (let's say a price has no decimal point) followed by `subject:€` sign.
That's what lookahead is for.
## Lookahead
The syntax is: `pattern:x(?=y)`, it means "look for `pattern:x`, but match only if followed by `pattern:y`".
For an integer amount followed by `subject:€`, the regexp will be `pattern:\d+(?=€)`:
```js run
let str = "1 turkey costs 30€";
alert( str.match(/\d+(?=€)/) ); // 30 (correctly skipped the sole number 1)
```
Let's say we want a quantity instead, that is a number, NOT followed by `subject:€`.
Here a negative lookahead can be applied.
The syntax is: `pattern:x(?!y)`, it means "search `pattern:x`, but only if not followed by `pattern:y`".
```js run
let str = "2 turkeys cost 60€";
alert( str.match(/\d+(?!€)/) ); // 2 (correctly skipped the price)
```
## Lookbehind
Lookahead allows to add a condition for "what goes after".
Lookbehind is similar, but it looks behind. That is, it allows to match a pattern only if there's something before.
The syntax is:
- Positive lookbehind: `pattern:(?<=y)x`, matches `pattern:x`, but only if it follows after `pattern:y`.
- Negative lookbehind: `pattern:(?<!y)x`, matches `pattern:x`, but only if there's no `pattern:y` before.
For example, let's change the price to US dollars. The dollar sign is usually before the number, so to look for `$30` we'll use `pattern:(?<=\$)\d+` -- an amount preceeded by `subject:$`:
```js run
let str = "1 turkey costs $30";
alert( str.match(/(?<=\$)\d+/) ); // 30 (skipped the sole number)
```
And, to find the quantity -- a number, not preceeded by `subject:$`, we can use a negative lookbehind `pattern:(?<!\$)\d+`:
```js run
let str = "2 turkeys cost $60";
alert( str.match(/(?<!\$)\d+/) ); // 2 (skipped the price)
```
## Capture groups
Generally, what's inside the lookaround (a common name for both lookahead and lookbehind) parentheses does not become a part of the match.
E.g. in the pattern `pattern:\d+(?!€)`, the `pattern:€` sign doesn't get captured as a part of the match.
But if we want to capture the whole lookaround expression or a part of it, that's possible. Just need to wrap that into additional parentheses.
For instance, here the currency `pattern:(€|kr)` is captured, along with the amount:
```js run
let str = "1 turkey costs 30€";
let reg = /\d+(?=(€|kr))/; // extra parentheses around €|kr
alert( str.match(reg) ); // 30, €
```
And here's the same for lookbehind:
```js run
let str = "1 turkey costs $30";
let reg = /(?<=(\$|£))\d+/;
alert( str.match(reg) ); // 30, $
```
Please note that for lookbehind the order stays be same, even though lookahead parentheses are before the main pattern.
Usually parentheses are numbered left-to-right, but lookbehind is an exception, it is always captured after the main pattern. So the match for `pattern:\d+` goes in the result first, and then for `pattern:(\$|£)`.
## Summary
Lookahead and lookbehind (commonly referred to as "lookaround") are useful for simple regular expressions, when we'd like not to take something into the match depending on the context before/after it.
Sometimes we can do the same manually, that is: match all and filter by context in the loop. Remember, `str.matchAll` and `reg.exec` return matches with `.index` property, so we know where exactly in the text it is. But generally regular expressions can do it better.
Lookaround types:
| Pattern | type | matches |
|--------------------|------------------|---------|
| `pattern:x(?=y)` | Positive lookahead | `x` if followed by `y` |
| `pattern:x(?!y)` | Negative lookahead | `x` if not followed by `y` |
| `pattern:(?<=y)x` | Positive lookbehind | `x` if after `y` |
| `pattern:(?<!y)x` | Negative lookbehind | `x` if not after `y` |
Lookahead can also used to disable backtracking. Why that may be needed -- see in the next chapter.

View file

@ -0,0 +1,293 @@
# Infinite backtracking problem
Some regular expressions are looking simple, but can execute veeeeeery long time, and even "hang" the JavaScript engine.
Sooner or later most developers occasionally face such behavior.
The typical situation -- a regular expression works fine sometimes, but for certain strings it "hangs" consuming 100% of CPU.
In a web-browser it kills the page. Not a good thing for sure.
For server-side Javascript it may become a vulnerability, and it uses regular expressions to process user data. Bad input will make the process hang, causing denial of service. The author personally saw and reported such vulnerabilities even for very well-known and widely used programs.
So the problem is definitely worth to deal with.
## Introductin
The plan will be like this:
1. First we see the problem how it may occur.
2. Then we simplify the situation and see why it occurs.
3. Then we fix it.
For instance let's consider searching tags in HTML.
We want to find all tags, with or without attributes -- like `subject:<a href="..." class="doc" ...>`. We need the regexp to work reliably, because HTML comes from the internet and can be messy.
In particular, we need it to match tags like `<a test="<>" href="#">` -- with `<` and `>` in attributes. That's allowed by [HTML standard](https://html.spec.whatwg.org/multipage/syntax.html#syntax-attributes).
Now we can see that a simple regexp like `pattern:<[^>]+>` doesn't work, because it stops at the first `>`, and we need to ignore `<>` if inside an attribute.
```js run
// the match doesn't reach the end of the tag - wrong!
alert( '<a test="<>" href="#">'.match(/<[^>]+>/) ); // <a test="<>
```
To correctly handle such situations we need a more complex regular expression. It will have the form `pattern:<tag (key=value)*>`.
1. For the `tag` name: `pattern:\w+`,
2. For the `key` name: `pattern:\w+`,
3. And the `value`: a quoted string `pattern:"[^"]*"`.
If we substitute these into the pattern above and throw in some optional spaces `pattern:\s`, the full regexp becomes: `pattern:<\w+(\s*\w+="[^"]*"\s*)*>`.
That regexp is not perfect! It doesn't yet support all details of HTML, for instance unquoted values, and there are other ways to improve, but let's not add complexity. It will demonstrate the problem for us.
The regexp seems to work:
```js run
let reg = /<\w+(\s*\w+="[^"]*"\s*)*>/g;
let str='...<a test="<>" href="#">... <b>...';
alert( str.match(reg) ); // <a test="<>" href="#">, <b>
```
Great! It found both the long tag `match:<a test="<>" href="#">` and the short one `match:<b>`.
Now, that we've got a seemingly working solution, let's get to the infinite backtracking itself.
## Infinite backtracking
If you run our regexp on the input below, it may hang the browser (or another JavaScript host):
```js run
let reg = /<\w+(\s*\w+="[^"]*"\s*)*>/g;
let str = `<tag a="b" a="b" a="b" a="b" a="b" a="b" a="b" a="b"
a="b" a="b" a="b" a="b" a="b" a="b" a="b" a="b" a="b" a="b" a="b" a="b" a="b"`;
*!*
// The search will take a long, long time
alert( str.match(reg) );
*/!*
```
Some regexp engines can handle that search, but most of them can't.
What's the matter? Why a simple regular expression "hangs" on such a small string?
Let's simplify the regexp by stripping the tag name and the quotes. So that we look only for `key=value` attributes: `pattern:<(\s*\w+=\w+\s*)*>`.
Unfortunately, the regexp still hangs:
```js run
// only search for space-delimited attributes
let reg = /<(\s*\w+=\w+\s*)*>/g;
let str = `<a=b a=b a=b a=b a=b a=b a=b a=b
a=b a=b a=b a=b a=b a=b a=b a=b a=b a=b a=b a=b a=b a=b`;
*!*
// the search will take a long, long time
alert( str.match(reg) );
*/!*
```
Here we end the demo of the problem and start looking into what's going on, why it hangs and how to fix it.
## Detailed example
To make an example even simpler, let's consider `pattern:(\d+)*$`.
This regular expression also has the same problem. In most regexp engines that search takes a very long time (careful -- can hang):
```js run
alert( '12345678901234567890123456789123456789z'.match(/(\d+)*$/) );
```
So what's wrong with the regexp?
First, one may notice that the regexp is a little bit strange. The quantifier `pattern:*` looks extraneous. If we want a number, we can use `pattern:\d+$`.
Indeed, the regexp is artificial. But the reason why it is slow is the same as those we saw above. So let's understand it, and then the previous example will become obvious.
What happen during the search of `pattern:(\d+)*$` in the line `subject:123456789z`?
1. First, the regexp engine tries to find a number `pattern:\d+`. The plus `pattern:+` is greedy by default, so it consumes all digits:
```
\d+.......
(123456789)z
```
2. Then it tries to apply the star quantifier, but there are no more digits, so it the star doesn't give anything.
3. Then the pattern expects to see the string end `pattern:$`, and in the text we have `subject:z`, so there's no match:
```
X
\d+........$
(123456789)z
```
4. As there's no match, the greedy quantifier `pattern:+` decreases the count of repetitions (backtracks).
Now `\d+` doesn't take all digits, but all except the last one:
```
\d+.......
(12345678)9z
```
5. Now the engine tries to continue the search from the new position (`9`).
The star `pattern:(\d+)*` can be applied -- it gives the number `match:9`:
```
\d+.......\d+
(12345678)(9)z
```
The engine tries to match `$` again, but fails, because meets `subject:z`:
```
X
\d+.......\d+
(12345678)(9)z
```
5. There's no match, so the engine will continue backtracking, decreasing the number of repetitions for `pattern:\d+` down to 7 digits. So the rest of the string `subject:89` becomes the second `pattern:\d+`:
```
X
\d+......\d+
(1234567)(89)z
```
...Still no match for `pattern:$`.
The search engine backtracks again. Backtracking generally works like this: the last greedy quantifier decreases the number of repetitions until it can. Then the previous greedy quantifier decreases, and so on. In our case the last greedy quantifier is the second `pattern:\d+`, from `subject:89` to `subject:8`, and then the star takes `subject:9`:
```
X
\d+......\d+\d+
(1234567)(8)(9)z
```
6. ...Fail again. The second and third `pattern:\d+` backtracked to the end, so the first quantifier shortens the match to `subject:123456`, and the star takes the rest:
```
X
\d+.......\d+
(123456)(789)z
```
Again no match. The process repeats: the last greedy quantifier releases one character (`9`):
```
X
\d+.....\d+ \d+
(123456)(78)(9)z
```
7. ...And so on.
The regular expression engine goes through all combinations of `123456789` and their subsequences. There are a lot of them, that's why it takes so long.
What to do?
Should we turn on the lazy mode?
Unfortunately, it doesn't: if we replace `pattern:\d+` with `pattern:\d+?`, that still hangs:
```js run
// sloooooowwwwww
alert( '12345678901234567890123456789123456789z'.match(/(\d+?)*$/) );
```
Lazy quantifiers actually do the same, but in the reverse order.
Just think about how the search engine would work in this case.
Some regular expression engines have tricky built-in checks to detect infinite backtracking or other means to work around them, but there's no universal solution.
## Back to tags
In the example above, when we search `pattern:<(\s*\w+=\w+\s*)*>` in the string `subject:<a=b a=b a=b a=b` -- the similar thing happens.
The string has no `>` at the end, so the match is impossible, but the regexp engine doesn't know about it. The search backtracks trying different combinations of `pattern:(\s*\w+=\w+\s*)`:
```
(a=b a=b a=b) (a=b)
(a=b a=b) (a=b a=b)
(a=b) (a=b a=b a=b)
...
```
## How to fix?
The backtracking checks many variants that are an obvious fail for a human.
For instance, in the pattern `pattern:(\d+)*$` a human can easily see that `pattern:(\d+)*` does not need to backtrack `pattern:+`. There's no difference between one or two `\d+`:
```
\d+........
(123456789)z
\d+...\d+....
(1234)(56789)z
```
Let's get back to more real-life example: `pattern:<(\s*\w+=\w+\s*)*>`. We want it to find pairs `name=value` (as many as it can).
What we would like to do is to forbid backtracking.
There's totally no need to decrease the number of repetitions.
In other words, if it found three `name=value` pairs and then can't find `>` after them, then there's no need to decrease the count of repetitions. There are definitely no `>` after those two (we backtracked one `name=value` pair, it's there):
```
(name=value) name=value
```
Modern regexp engines support so-called "possessive" quantifiers for that. They are like greedy, but don't backtrack at all. Pretty simple, they capture whatever they can, and the search continues. There's also another tool called "atomic groups" that forbid backtracking inside parentheses.
Unfortunately, but both these features are not supported by JavaScript.
### Lookahead to the rescue
We can get forbid backtracking using lookahead.
The pattern to take as much repetitions as possible without backtracking is: `pattern:(?=(a+))\1`.
In other words:
- The lookahead `pattern:?=` looks for the maximal count `pattern:a+` from the current position.
- And then they are "consumed into the result" by the backreference `pattern:\1` (`pattern:\1` corresponds to the content of the second parentheses, that is `pattern:a+`).
There will be no backtracking, because lookahead does not backtrack. If it found like 5 times of `pattern:a+` and the further match failed, then it doesn't go back to 4.
```smart
There's more about the relation between possessive quantifiers and lookahead in articles [Regex: Emulate Atomic Grouping (and Possessive Quantifiers) with LookAhead](http://instanceof.me/post/52245507631/regex-emulate-atomic-grouping-with-lookahead) and [Mimicking Atomic Groups](http://blog.stevenlevithan.com/archives/mimic-atomic-groups).
```
So this trick makes the problem disappear.
Let's fix the regexp for a tag with attributes from the beginning of the chapter`pattern:<\w+(\s*\w+=(\w+|"[^"]*")\s*)*>`. We'll use lookahead to prevent backtracking of `name=value` pairs:
```js run
// regexp to search name=value
let reg = /(\s*\w+=(\w+|"[^"]*")\s*)/
// use new RegExp to nicely insert its source into (?=(a+))\1
let fixedReg = new RegExp(`<\\w+(?=(${attrReg.source}*))\\1>`, 'g');
let goodInput = '...<a test="<>" href="#">... <b>...';
let badInput = `<tag a=b a=b a=b a=b a=b a=b a=b a=b
a=b a=b a=b a=b a=b a=b a=b a=b a=b a=b a=b a=b a=b`;
alert( goodInput.match(fixedReg) ); // <a test="<>" href="#">, <b>
alert( badInput.match(fixedReg) ); // null (no results, fast!)
```
Great, it works! We found both a long tag `match:<a test="<>" href="#">` and a small one `match:<b>`, and (!) didn't hang the engine on the bad input.
Please note the `attrReg.source` property. `RegExp` objects provide access to their source string in it. That's convenient when we want to insert one regexp into another.

View file

@ -0,0 +1,89 @@
# Unicode: flag "u"
The unicode flag `/.../u` enables the correct support of surrogate pairs.
Surrogate pairs are explained in the chapter <info:string>.
Let's briefly remind them here. In short, normally characters are encoded with 2 bytes. That gives us 65536 characters maximum. But there are more characters in the world.
So certain rare characters are encoded with 4 bytes, like `𝒳` (mathematical X) or `😄` (a smile).
Here are the unicode values to compare:
| Character | Unicode | Bytes |
|------------|---------|--------|
| `a` | 0x0061 | 2 |
| `≈` | 0x2248 | 2 |
|`𝒳`| 0x1d4b3 | 4 |
|`𝒴`| 0x1d4b4 | 4 |
|`😄`| 0x1f604 | 4 |
So characters like `a` and `≈` occupy 2 bytes, and those rare ones take 4.
The unicode is made in such a way that the 4-byte characters only have a meaning as a whole.
In the past JavaScript did not know about that, and many string methods still have problems. For instance, `length` thinks that here are two characters:
```js run
alert('😄'.length); // 2
alert('𝒳'.length); // 2
```
...But we can see that there's only one, right? The point is that `length` treats 4 bytes as two 2-byte characters. That's incorrect, because they must be considered only together (so-called "surrogate pair").
Normally, regular expressions also treat "long characters" as two 2-byte ones.
That leads to odd results, for instance let's try to find `pattern:[𝒳𝒴]` in the string `subject:𝒳`:
```js run
alert( '𝒳'.match(/[𝒳𝒴]/) ); // odd result (wrong match actually, "half-character")
```
The result is wrong, because by default the regexp engine does not understand surrogate pairs.
So, it thinks that `[𝒳𝒴]` are not two, but four characters:
1. the left half of `𝒳` `(1)`,
2. the right half of `𝒳` `(2)`,
3. the left half of `𝒴` `(3)`,
4. the right half of `𝒴` `(4)`.
We can list them like this:
```js run
for(let i=0; i<'𝒳𝒴'.length; i++) {
alert('𝒳𝒴'.charCodeAt(i)); // 55349, 56499, 55349, 56500
};
```
So it finds only the "left half" of `𝒳`.
In other words, the search works like `'12'.match(/[1234]/)`: only `1` is returned.
## The "u" flag
The `/.../u` flag fixes that.
It enables surrogate pairs in the regexp engine, so the result is correct:
```js run
alert( '𝒳'.match(/[𝒳𝒴]/u) ); // 𝒳
```
Let's see one more example.
If we forget the `u` flag and occasionally use surrogate pairs, then we can get an error:
```js run
'𝒳'.match(/[𝒳-𝒴]/); // SyntaxError: invalid range in character class
```
Normally, regexps understand `[a-z]` as a "range of characters with codes between codes of `a` and `z`.
But without `u` flag, surrogate pairs are assumed to be a "pair of independant characters", so `[𝒳-𝒴]` is like `[<55349><56499>-<55349><56500>]` (replaced each surrogate pair with code points). Now we can clearly see that the range `56499-55349` is unacceptable, as the left range border must be less than the right one.
Using the `u` flag makes it work right:
```js run
alert( '𝒴'.match(/[𝒳-𝒵]/u) ); // 𝒴
```

View file

@ -0,0 +1,86 @@
# Unicode character properies \p
[Unicode](https://en.wikipedia.org/wiki/Unicode), the encoding format used by Javascript strings, has a lot of properties for different characters (or, technically, code points). They describe which "categories" character belongs to, and a variety of technical details.
In regular expressions these can be set by `\p{…}`. And there must be flag `'u'`.
For instance, `\p{Letter}` denotes a letter in any of language. We can also use `\p{L}`, as `L` is an alias of `Letter`, there are shorter aliases for almost every property.
Here's the main tree of properties:
- Letter `L`:
- lowercase `Ll`, modifier `Lm`, titlecase `Lt`, uppercase `Lu`, other `Lo`
- Number `N`:
- decimal digit `Nd`, letter number `Nl`, other `No`:
- Punctuation `P`:
- connector `Pc`, dash `Pd`, initial quote `Pi`, final quote `Pf`, open `Ps`, close `Pe`, other `Po`
- Mark `M` (accents etc):
- spacing combining `Mc`, enclosing `Me`, non-spacing `Mn`
- Symbol `S`:
- currency `Sc`, modifier `Sk`, math `Sm`, other `So`
- Separator `Z`:
- line `Zl`, paragraph `Zp`, space `Zs`
- Other `C`:
- control `Cc`, format `Cf`, not assigned `Cn`, private use `Co`, surrogate `Cs`.
```smart header="More information"
Interested to see which characters belong to a property? There's a tool at <http://cldr.unicode.org/unicode-utilities/list-unicodeset> for that.
You could also explore properties at [Character Property Index](http://unicode.org/cldr/utility/properties.jsp).
For the full Unicode Character Database in text format (along with all properties), see <https://www.unicode.org/Public/UCD/latest/ucd/>.
```
There are also other derived categories, like:
- `Alphabetic` (`Alpha`), includes Letters `L`, plus letter numbers `Nl` (e.g. roman numbers Ⅻ), plus some other symbols `Other_Alphabetic` (`OAltpa`).
- `Hex_Digit` includes hexadimal digits: `0-9`, `a-f`.
- ...Unicode is a big beast, it includes a lot of properties.
For instance, let's look for a 6-digit hex number:
```js run
let reg = /\p{Hex_Digit}{6}/u; // flag 'u' is requireds
alert("color: #123ABC".match(reg)); // 123ABC
```
There are also properties with a value. For instance, Unicode "Script" (a writing system) can be Cyrillic, Greek, Arabic, Han (Chinese) etc, the [list is long]("https://en.wikipedia.org/wiki/Script_(Unicode)").
To search for certain scripts, we should supply `Script=<value>`, e.g. to search for cyrillic letters: `\p{sc=Cyrillic}`, for Chinese glyphs: `\p{sc=Han}`, etc:
```js run
let regexp = /\p{sc=Han}+/gu; // get chinese words
let str = `Hello Привет 你好 123_456`;
alert( str.match(regexp) ); // 你好
```
## Building multi-language \w
Let's make a "universal" regexp for `pattern:\w`, for any language. That task has a standard solution in many programming languages with unicode-aware regexps, e.g. Perl.
```js
/[\p{Alphabetic}\p{Mark}\p{Decimal_Number}\p{Connector_Punctuation}\p{Join_Control}]/u
```
Let's decipher. Remember, `pattern:\w` is actually the same as `pattern:[a-zA-Z0-9_]`.
So the character set includes:
- `Alphabetic` for letters,
- `Mark` for accents, as in Unicode accents may be represented by separate code points,
- `Decimal_Number` for numbers,
- `Connector_Punctuation` for the `'_'` character and alike,
- `Join_Control` - two special code points with hex codes `200c` and `200d`, used in ligatures e.g. in arabic.
Or, if we replace long names with aliases (a list of aliases [here](https://www.unicode.org/Public/UCD/latest/ucd/PropertyValueAliases.txt)):
```js run
let regexp = /([\p{Alpha}\p{M}\p{Nd}\p{Pc}\p{Join_C}]+)/gu;
let str = `Hello Привет 你好 123_456`;
alert( str.match(regexp) ); // Hello,Привет,你好,123_456
```

View file

@ -0,0 +1,71 @@
# Sticky flag "y", searching at position
To grasp the use case of `y` flag, and see how great it is, let's explore a practical use case.
One of common tasks for regexps is "parsing": when we get a text and analyze it for logical components, build a structure.
For instance, there are HTML parsers for browser pages, that turn text into a structured document. There are parsers for programming languages, like Javascript, etc.
Writing parsers is a special area, with its own tools and algorithms, so we don't go deep in there, but there's a very common question: "What is the text at the given position?".
For instance, for a programming language variants can be like:
- Is it a "name" `pattern:\w+`?
- Or is it a number `pattern:\d+`?
- Or an operator `pattern:[+-/*]`?
- (a syntax error if it's not anything in the expected list)
In Javascript, to perform a search starting from a given position, we can use `regexp.exec` with `regexp.lastIndex` property, but that's not we need!
We'd like to check the match exactly at given position, not "starting" from it.
Here's a (failing) attempt to use `lastIndex`:
```js run
let str = "(text before) function ...";
// attempting to find function at position 5:
let regexp = /function/g; // must use "g" flag, otherwise lastIndex is ignored
regexp.lastIndex = 5
alert (regexp.exec(str)); // function
```
The match is found, because `regexp.exec` starts to search from the given position and goes on by the text, successfully matching "function" later.
We could work around that by checking if "`regexp.exec(str).index` property is `5`, and if not, ignore the much. But the main problem here is performance.
The regexp engine does a lot of unnecessary work by scanning at further positions. The delays are clearly noticeable if the text is long, because there are many such searches in a parser.
## The "y" flag
So we've came to the problem: how to search for a match, starting exactly at the given position.
That's what `y` flag does. It makes the regexp search only at the `lastIndex` position.
Here's an example
```js run
let str = "(text before) function ...";
*!*
let regexp = /function/y;
regexp.lastIndex = 5;
*/!*
alert (regexp.exec(str)); // null (no match, unlike "g" flag!)
*!*
regexp.lastIndex = 14;
*/!*
alert (regexp.exec(str)); // function (match!)
```
As we can see, now the regexp is only matched at the given position.
So what `y` does is truly unique, and very important for writing parsers.
The `y` flag allows to apply a regular expression (or many of them one-by-one) exactly at the given position and when we understand what's there, we can move on -- step by step examining the text.
Without the flag the regexp engine always searches till the end of the text, that takes time, especially if the text is large. So our parser would be very slow. The `y` flag is exactly the right thing here.

View file

@ -0,0 +1,3 @@
# Regular expressions
Regular expressions is a powerful way of doing search and replace in strings.