regexp draft

This commit is contained in:
Ilya Kantor 2019-03-02 01:02:01 +03:00
parent 1369332661
commit 65184edf76
11 changed files with 730 additions and 399 deletions

View file

@ -96,34 +96,32 @@ There are only 5 of them in JavaScript:
`m`
: Multiline mode (covered in the chapter <info:regexp-multiline>).
`s`
: "Dotall" mode, allows `.` to match newlines (covered in the chapter <info:regexp-character-classes>).
`u`
: Enables full unicode support. The flag enables correct processing of surrogate pairs. More about that in the chapter <info:regexp-unicode>.
`y`
: Sticky mode (covered in the [next chapter](info:regexp-methods#y-flag))
We'll cover all these flags further in the tutorial.
## The "i" flag
The simplest flag is `i`.
An example with it:
For now, the simplest flag is `i`, here's an example:
```js run
let str = "I love JavaScript!";
alert( str.search(/LOVE/) ); // -1 (not found)
alert( str.search(/LOVE/i) ); // 2
```
alert( str.search(/LOVE/i) ); // 2 (found lowercased)
1. The first search returns `-1` (not found), because the search is case-sensitive by default.
2. With the flag `pattern:/LOVE/i` the search found `match:love` at position 2.
alert( str.search(/LOVE/) ); // -1 (nothing found without 'i' flag)
```
So the `i` flag already makes regular expressions more powerful than a simple substring search. But there's so much more. We'll cover other flags and features in the next chapters.
## Summary
- A regular expression consists of a pattern and optional flags: `g`, `i`, `m`, `u`, `y`.
- A regular expression consists of a pattern and optional flags: `g`, `i`, `m`, `u`, `s`, `y`.
- Without flags and special symbols that we'll study later, the search by a regexp is the same as a substring search.
- The method `str.search(regexp)` returns the index where the match is found or `-1` if there's no match.
- The method `str.search(regexp)` returns the index where the match is found or `-1` if there's no match. In the next chapter we'll see other methods.

View file

@ -5,7 +5,34 @@ There are two sets of methods to deal with regular expressions.
1. First, regular expressions are objects of the built-in [RegExp](mdn:js/RegExp) class, it provides many methods.
2. Besides that, there are methods in regular strings can work with regexps.
The structure is a bit messed up, so we'll first consider methods separately, and then -- practical recipes for common tasks.
## Recipes
Which method to use depends on what we'd like to do.
Methods become much easier to understand if we separate them by their use in real-life tasks:
**To search for all matches:**
Use regexp `g` flag and:
- Get a flat array of matches -- `str.match(reg)`
- Get an array or matches with details -- `str.matchAll(reg)`.
**To search for the first match only:**
- Get the full first match -- `str.match(reg)` (without `g` flag).
- Get the string position of the first match -- `str.search(reg)`.
- Check if there's a match -- `regexp.test(str)`.
- Find the match from the given position -- `regexp.exec(str)` (set `regexp.lastIndex` to position).
**To replace all matches:**
- Replace with another string or a function result -- `str.replace(reg, str|func)`
**To split the string by a separator:**
- `str.split(str|reg)`
Now you get the details about every method in this chapter... But if you're reading for the first time, and want to know more about regexps - go ahead!
You may want to skip methods for now, move on to the next chapter, and then return here if something about a method is unclear.
## str.search(reg)
@ -17,15 +44,15 @@ let str = "A drop of ink may make a million think";
alert( str.search( *!*/a/i*/!* ) ); // 0 (the first position)
```
**The important limitation: `search` always looks for the first match.**
**The important limitation: `search` only finds the first match.**
We can't find next positions using `search`, there's just no syntax for that. But there are other methods that can.
## str.match(reg), no "g" flag
The method `str.match` behavior varies depending on the `g` flag. First let's see the case without it.
The behavior of `str.match` varies depending on whether `reg` has `g` flag or not.
Then `str.match(reg)` looks for the first match only.
First, if there's no `g` flag, then `str.match(reg)` looks for the first match only.
The result is an array with that match and additional properties:
@ -44,9 +71,11 @@ alert( result.index ); // 0 (at the zero position)
alert( result.input ); // "Fame is the thirst of youth" (the string)
```
The array may have more than one element.
A match result may have more than one element.
**If a part of the pattern is delimited by parentheses `(...)`, then it becomes a separate element of the array.**
**If a part of the pattern is delimited by parentheses `(...)`, then it becomes a separate element in the array.**
If parentheses have a name, designated by `(?<name>...)` at their start, then `result.groups[name]` has the content. We'll see that later in the chapter [about groups](info:regexp-groups).
For instance:
@ -63,7 +92,8 @@ alert( result.input ); // JavaScript is a programming language
Due to the `i` flag the search is case-insensitive, so it finds `match:JavaScript`. The part of the match that corresponds to `pattern:SCRIPT` becomes a separate array item.
We'll be back to parentheses later in the chapter <info:regexp-groups>. They are great for search-and-replace.
So, this method is used to find one full match with all details.
## str.match(reg) with "g" flag
@ -76,10 +106,10 @@ let str = "HO-Ho-ho!";
let result = str.match( *!*/ho/ig*/!* );
alert( result ); // HO, Ho, ho (all matches, case-insensitive)
alert( result ); // HO, Ho, ho (array of 3 matches, case-insensitive)
```
With parentheses nothing changes, here we go:
Parentheses do not change anything, here we go:
```js run
let str = "HO-Ho-ho!";
@ -89,22 +119,84 @@ let result = str.match( *!*/h(o)/ig*/!* );
alert( result ); // HO, Ho, ho
```
So, with `g` flag the `result` is a simple array of matches. No additional properties.
**So, with `g` flag `str.match` returns a simple array of all matches, without details.**
If we want to get information about match positions and use parentheses then we should use [RegExp#exec](mdn:js/RegExp/exec) method that we'll cover below.
If we want to get information about match positions and contents of parentheses then we should use `matchAll` method that we'll cover below.
````warn header="If there are no matches, the call to `match` returns `null`"
Please note, that's important. If there were no matches, the result is not an empty array, but `null`.
````warn header="If there are no matches, `str.match` returns `null`"
Please note, that's important. If there are no matches, the result is not an empty array, but `null`.
Keep that in mind to evade pitfalls like this:
```js run
let str = "Hey-hey-hey!";
alert( str.match(/ho/gi).length ); // error! there's no length of null
alert( str.match(/Z/g).length ); // Error: Cannot read property 'length' of null
```
Here `str.match(/Z/g)` is `null`, it has no `length` property.
````
## str.matchAll(regexp)
The method `str.matchAll(regexp)` is used to find all matches with all details.
For instance:
```js run
let str = "Javascript or JavaScript? Should we uppercase 'S'?";
let result = str.matchAll( *!*/java(script)/ig*/!* );
let [match1, match2] = result;
alert( match1[0] ); // Javascript (the whole match)
alert( match1[1] ); // script (the part of the match that corresponds to the parentheses)
alert( match1.index ); // 0
alert( match1.input ); // = str (the whole original string)
alert( match2[0] ); // JavaScript (the whole match)
alert( match2[1] ); // Script (the part of the match that corresponds to the parentheses)
alert( match2.index ); // 14
alert( match2.input ); // = str (the whole original string)
```
````warn header="`matchAll` returns an iterable, not array"
For instance, if we try to get the first match by index, it won't work:
```js run
let str = "Javascript or JavaScript??";
let result = str.matchAll( /javascript/ig );
*!*
alert(result[0]); // undefined (?! there must be a match)
*/!*
```
The reason is that the iterator is not an array. We need to run `Array.from(result)` on it, or use `for..of` loop to get matches.
In practice, if we need all matches, then `for..of` works, so it's not a problem.
And, to get only few matches, we can use destructuring:
```js run
let str = "Javascript or JavaScript??";
*!*
let [firstMatch] = str.matchAll( /javascript/ig );
*/!*
alert(firstMatch); // Javascript
```
````
```warn header="`matchAll` is supernew, may need a polyfill"
The method may not work in old browsers. A polyfill might be needed (this site uses core-js).
Or you could make a loop with `regexp.exec`, explained below.
```
## str.split(regexp|substr, limit)
Splits the string using the regexp (or a substring) as a delimiter.
@ -112,27 +204,31 @@ Splits the string using the regexp (or a substring) as a delimiter.
We already used `split` with strings, like this:
```js run
alert('12-34-56'.split('-')) // [12, 34, 56]
alert('12-34-56'.split('-')) // array of [12, 34, 56]
```
But we can also pass a regular expression:
But we can split by a regular expression, the same way:
```js run
alert('12-34-56'.split(/-/)) // [12, 34, 56]
alert('12-34-56'.split(/-/)) // array of [12, 34, 56]
```
## str.replace(str|reg, str|func)
The swiss army knife for search and replace in strings.
That's actually a great method, one of most useful ones. The swiss army knife for searching and replacing.
The simplest use -- search and replace a substring, like this:
The simplest use -- searching and replacing a substring, like this:
```js run
// replace a dash by a colon
alert('12-34-56'.replace("-", ":")) // 12:34-56
```
When the first argument of `replace` is a string, it only looks for the first match.
There's a pitfall though.
**When the first argument of `replace` is a string, it only looks for the first match.**
You can see that in the example above: only the first `"-"` is replaced by `":"`.
To find all dashes, we need to use not the string `"-"`, but a regexp `pattern:/-/g`, with an obligatory `g` flag:
@ -141,9 +237,7 @@ To find all dashes, we need to use not the string `"-"`, but a regexp `pattern:/
alert( '12-34-56'.replace( *!*/-/g*/!*, ":" ) ) // 12:34:56
```
The second argument is a replacement string.
We can use special characters in it:
The second argument is a replacement string. We can use special characters in it:
| Symbol | Inserts |
|--------|--------|
@ -151,24 +245,33 @@ We can use special characters in it:
|`$&`|the whole match|
|<code>$&#096;</code>|a part of the string before the match|
|`$'`|a part of the string after the match|
|`$n`|if `n` is a 1-2 digit number, then it means the contents of n-th parentheses counting from left to right|
|`$n`|if `n` is a 1-2 digit number, then it means the contents of n-th parentheses counting from left to right, otherwise it means a parentheses with the given name |
For instance let's use `$&` to replace all entries of `"John"` by `"Mr.John"`:
For instance if we use `$&` in the replacement string, that means "put the whole match here".
Let's use it to prepend all entries of `"John"` with `"Mr."`:
```js run
let str = "John Doe, John Smith and John Bull.";
let str = "John Doe, John Smith and John Bull";
// for each John - replace it with Mr. and then John
alert(str.replace(/John/g, 'Mr.$&'));
// "Mr.John Doe, Mr.John Smith and Mr.John Bull.";
alert(str.replace(/John/g, 'Mr.$&')); // Mr.John Doe, Mr.John Smith and Mr.John Bull
```
Parentheses are very often used together with `$1`, `$2`, like this:
Quite often we'd like to reuse parts of the source string, recombine them in the replacement or wrap into something.
To do so, we should:
1. First, mark the parts by parentheses in regexp.
2. Use `$1`, `$2` (and so on) in the replacement string to get the content matched by parentheses.
For instance:
```js run
let str = "John Smith";
alert(str.replace(/(John) (Smith)/, '$2, $1')) // Smith, John
// swap first and last name
alert(str.replace(/(john) (smith)/i, '$2, $1')) // Smith, John
```
**For situations that require "smart" replacements, the second argument can be a function.**
@ -188,21 +291,22 @@ alert("HO-Ho-ho".replace(/ho/gi, function() {
In the example above the function just returns the next number every time, but usually the result is based on the match.
The function is called with arguments `func(str, p1, p2, ..., pn, offset, s)`:
The function is called with arguments `func(str, p1, p2, ..., pn, offset, input, groups)`:
1. `str` -- the match,
2. `p1, p2, ..., pn` -- contents of parentheses (if there are any),
3. `offset` -- position of the match,
4. `s` -- the source string.
4. `input` -- the source string,
5. `groups` -- an object with named groups (see chapter [](info:regexp-groups)).
If there are no parentheses in the regexp, then the function always has 3 arguments: `func(str, offset, s)`.
If there are no parentheses in the regexp, then there are only 3 arguments: `func(str, offset, input)`.
Let's use it to show full information about matches:
```js run
// show and replace all matches
function replacer(str, offset, s) {
alert(`Found ${str} at position ${offset} in string ${s}`);
function replacer(str, offset, input) {
alert(`Found ${str} at position ${offset} in string ${input}`);
return str.toLowerCase();
}
@ -215,10 +319,10 @@ alert( 'Result: ' + result ); // Result: ho-ho-ho
// Found ho at position 6 in string HO-Ho-ho
```
In the example below there are two parentheses, so `replacer` is called with 5 arguments: `str` is the full match, then parentheses, and then `offset` and `s`:
In the example below there are two parentheses, so `replacer` is called with 5 arguments: `str` is the full match, then parentheses, and then `offset` and `input`:
```js run
function replacer(str, name, surname, offset, s) {
function replacer(str, name, surname, offset, input) {
// name is the first parentheses, surname is the second one
return surname + ", " + name;
}
@ -230,13 +334,70 @@ alert(str.replace(/(John) (Smith)/, replacer)) // Smith, John
Using a function gives us the ultimate replacement power, because it gets all the information about the match, has access to outer variables and can do everything.
## regexp.exec(str)
We've already seen these searching methods:
- `search` -- looks for the position of the match,
- `match` -- if there's no `g` flag, returns the first match with parentheses and all details,
- `match` -- if there's a `g` flag -- returns all matches, without details parentheses,
- `matchAll` -- returns all matches with details.
The `regexp.exec` method is the most flexible searching method of all. Unlike previous methods, `exec` should be called on a regexp, rather than on a string.
It behaves differently depending on whether the regexp has the `g` flag.
If there's no `g`, then `regexp.exec(str)` returns the first match, exactly as `str.match(reg)`. Such behavior does not give us anything new.
But if there's `g`, then:
- `regexp.exec(str)` returns the first match and *remembers* the position after it in `regexp.lastIndex` property.
- The next call starts to search from `regexp.lastIndex` and returns the next match.
- If there are no more matches then `regexp.exec` returns `null` and `regexp.lastIndex` is set to `0`.
We could use it to get all matches with their positions and parentheses groups in a loop, instead of `matchAll`:
```js run
let str = 'A lot about JavaScript at https://javascript.info';
let regexp = /javascript/ig;
let result;
while (result = regexp.exec(str)) {
alert( `Found ${result[0]} at ${result.index}` );
// shows: Found JavaScript at 12, then:
// shows: Found javascript at 34
}
```
Surely, `matchAll` does the same, at least for modern browsers. But what `matchAll` can't do -- is to search from a given position.
Let's search from position `13`. What we need is to assign `regexp.lastIndex=13` and call `regexp.exec`:
```js run
let str = "A lot about JavaScript at https://javascript.info";
let regexp = /javascript/ig;
*!*
regexp.lastIndex = 13;
*/!*
let result;
while (result = regexp.exec(str)) {
alert( `Found ${result[0]} at ${result.index}` );
// shows: Found javascript at 34
}
```
Now, starting from the given position `13`, there's only one match.
## regexp.test(str)
Let's move on to the methods of `RegExp` class, that are callable on regexps themselves.
The method `regexp.test(str)` looks for a match and returns `true/false` whether it finds it.
The `test` method looks for any match and returns `true/false` whether they found it.
So it's basically the same as `str.search(reg) != -1`, for instance:
For instance:
```js run
let str = "I love JavaScript";
@ -255,153 +416,43 @@ alert( *!*/love/i*/!*.test(str) ); // false
alert( str.search(*!*/love/i*/!*) != -1 ); // false
```
## regexp.exec(str)
If the regexp has `'g'` flag, then `regexp.test` advances `regexp.lastIndex` property, just like `regexp.exec`.
We've already seen these searching methods:
- `search` -- looks for the position of the match,
- `match` -- if there's no `g` flag, returns the first match with parentheses,
- `match` -- if there's a `g` flag -- returns all matches, without separating parentheses.
The `regexp.exec` method is a bit harder to use, but it allows to search all matches with parentheses and positions.
It behaves differently depending on whether the regexp has the `g` flag.
- If there's no `g`, then `regexp.exec(str)` returns the first match, exactly as `str.match(reg)`.
- If there's `g`, then `regexp.exec(str)` returns the first match and *remembers* the position after it in `regexp.lastIndex` property. The next call starts to search from `regexp.lastIndex` and returns the next match. If there are no more matches then `regexp.exec` returns `null` and `regexp.lastIndex` is set to `0`.
As we can see, the method gives us nothing new if we use it without the `g` flag, because `str.match` does exactly the same.
But the `g` flag allows to get all matches with their positions and parentheses groups.
Here's the example how subsequent `regexp.exec` calls return matches one by one:
So we can use it to search from a given position:
```js run
let str = "A lot about JavaScript at https://javascript.info";
let regexp = /love/gi;
let regexp = /JAVA(SCRIPT)/ig;
let str = "I love JavaScript";
*!*
// Look for the first match
*/!*
let matchOne = regexp.exec(str);
alert( matchOne[0] ); // JavaScript
alert( matchOne[1] ); // script
alert( matchOne.index ); // 12 (the position of the match)
alert( matchOne.input ); // the same as str
alert( regexp.lastIndex ); // 22 (the position after the match)
*!*
// Look for the second match
*/!*
let matchTwo = regexp.exec(str); // continue searching from regexp.lastIndex
alert( matchTwo[0] ); // javascript
alert( matchTwo[1] ); // script
alert( matchTwo.index ); // 34 (the position of the match)
alert( matchTwo.input ); // the same as str
alert( regexp.lastIndex ); // 44 (the position after the match)
*!*
// Look for the third match
*/!*
let matchThree = regexp.exec(str); // continue searching from regexp.lastIndex
alert( matchThree ); // null (no match)
alert( regexp.lastIndex ); // 0 (reset)
// start the search from position 10:
regexp.lastIndex = 10
alert( regexp.test(str) ); // false (no match)
```
As we can see, each `regexp.exec` call returns the match in a "full format": as an array with parentheses, `index` and `input` properties.
The main use case for `regexp.exec` is to find all matches in a loop:
````warn header="Same global regexp tested repeatedly may fail to match"
If we apply the same global regexp to different inputs, it may lead to wrong result, because `regexp.test` call advances `regexp.lastIndex` property, so next matches start from non-zero position.
For instance, here we call `regexp.test` twice on the same text, and the second time fails:
```js run
let str = 'A lot about JavaScript at https://javascript.info';
let regexp = /javascript/g; // (regexp just created: regexp.lastIndex=0)
let regexp = /javascript/ig;
let result;
while (result = regexp.exec(str)) {
alert( `Found ${result[0]} at ${result.index}` );
}
alert( regexp.test("javascript") ); // true (regexp.lastIndex=10 now)
alert( regexp.test("javascript") ); // false
```
The loop continues until `regexp.exec` returns `null` that means "no more matches".
That's exactly because `regexp.lastIndex` is non-zero on the second test.
````smart header="Search from the given position"
We can force `regexp.exec` to start searching from the given position by setting `lastIndex` manually:
```js run
let str = 'A lot about JavaScript at https://javascript.info';
let regexp = /javascript/ig;
regexp.lastIndex = 30;
alert( regexp.exec(str).index ); // 34, the search starts from the 30th position
```
To work around that, one could use non-global regexps or re-adjust `regexp.lastIndex=0` before a new search.
````
## The "y" flag [#y-flag]
## Summary
The `y` flag means that the search should find a match exactly at the position specified by the property `regexp.lastIndex` and only there.
There's a variety of many methods on both regexps and strings.
In other words, normally the search is made in the whole string: `pattern:/javascript/` looks for "javascript" everywhere in the string.
Their abilities and methods overlap quite a bit, we can do the same by different calls. Sometimes that may cause confusion when starting to learn the language.
But when a regexp has the `y` flag, then it only looks for the match at the position specified in `regexp.lastIndex` (`0` by default).
For instance:
```js run
let str = "I love JavaScript!";
let reg = /javascript/iy;
alert( reg.lastIndex ); // 0 (default)
alert( str.match(reg) ); // null, not found at position 0
reg.lastIndex = 7;
alert( str.match(reg) ); // JavaScript (right, that word starts at position 7)
// for any other reg.lastIndex the result is null
```
The regexp `pattern:/javascript/iy` can only be found if we set `reg.lastIndex=7`, because due to `y` flag the engine only tries to find it in the single place within a string -- from the `reg.lastIndex` position.
So, what's the point? Where do we apply that?
The reason is performance.
The `y` flag works great for parsers -- programs that need to "read" the text and build in-memory syntax structure or perform actions from it. For that we move along the text and apply regular expressions to see what we have next: a string? A number? Something else?
The `y` flag allows to apply a regular expression (or many of them one-by-one) exactly at the given position and when we understand what's there, we can move on -- step by step examining the text.
Without the flag the regexp engine always searches till the end of the text, that takes time, especially if the text is large. So our parser would be very slow. The `y` flag is exactly the right thing here.
## Summary, recipes
Methods become much easier to understand if we separate them by their use in real-life tasks.
To search for the first match only:
: - Find the position of the first match -- `str.search(reg)`.
- Find the full match -- `str.match(reg)`.
- Check if there's a match -- `regexp.test(str)`.
- Find the match from the given position -- `regexp.exec(str)`, set `regexp.lastIndex` to position.
To search for all matches:
: - An array of matches -- `str.match(reg)`, the regexp with `g` flag.
- Get all matches with full information about each one -- `regexp.exec(str)` with `g` flag in the loop.
To search and replace:
: - Replace with another string or a function result -- `str.replace(reg, str|func)`
To split the string:
: - `str.split(str|reg)`
We also covered two flags:
- The `g` flag to find all matches (global search),
- The `y` flag to search at exactly the given position inside the text.
Now we know the methods and can use regular expressions. But we need to learn their syntax, so let's move on.
Then please refer to the recipes at the beginning of this chapter, as they provide solutions for the majority of regexp-related tasks.

View file

@ -1,12 +1,14 @@
# Character classes
Consider a practical task -- we have a phone number `"+7(903)-123-45-67"`, and we need to find all digits in that string. Other characters do not interest us.
Consider a practical task -- we have a phone number `"+7(903)-123-45-67"`, and we need to turn it into pure numbers: `79035419441`.
A character class is a special notation that matches any symbol from the set.
To do so, we can find and remove anything that's not a number. Character classes can help with that.
For instance, there's a "digit" class. It's written as `\d`. We put it in the pattern, and during the search any digit matches it.
A character class is a special notation that matches any symbol from a certain set.
For instance, the regexp `pattern:/\d/` looks for a single digit:
For the start, let's explore a "digit" class. It's written as `\d`. We put it in the pattern, that means "any single digit".
For instance, the let's find the first digit in the phone number:
```js run
let str = "+7(903)-123-45-67";
@ -16,9 +18,9 @@ let reg = /\d/;
alert( str.match(reg) ); // 7
```
The regexp is not global in the example above, so it only looks for the first match.
Without the flag `g`, the regular expression only looks for the first match, that is the first digit `\d`.
Let's add the `g` flag to look for all digits:
Let's add the `g` flag to find all digits:
```js run
let str = "+7(903)-123-45-67";
@ -26,9 +28,9 @@ let str = "+7(903)-123-45-67";
let reg = /\d/g;
alert( str.match(reg) ); // array of matches: 7,9,0,3,1,2,3,4,5,6,7
```
## Most used classes: \d \s \w
alert( str.match(reg).join('') ); // 79035419441
```
That was a character class for digits. There are other character classes as well.
@ -43,9 +45,9 @@ Most used are:
`\w` ("w" is from "word")
: A "wordly" character: either a letter of English alphabet or a digit or an underscore. Non-english letters (like cyrillic or hindi) do not belong to `\w`.
For instance, `pattern:\d\s\w` means a digit followed by a space character followed by a wordly character, like `"1 Z"`.
For instance, `pattern:\d\s\w` means a "digit" followed by a "space character" followed by a "wordly character", like `"1 a"`.
A regexp may contain both regular symbols and character classes.
**A regexp may contain both regular symbols and character classes.**
For instance, `pattern:CSS\d` matches a string `match:CSS` with a digit after it:
@ -68,7 +70,7 @@ The match (each character class corresponds to one result character):
## Word boundary: \b
The word boundary `pattern:\b` -- is a special character class.
A word boundary `pattern:\b` -- is a special character class.
It does not denote a character, but rather a boundary between characters.
@ -79,32 +81,39 @@ alert( "Hello, Java!".match(/\bJava\b/) ); // Java
alert( "Hello, JavaScript!".match(/\bJava\b/) ); // null
```
The boundary has "zero width" in a sense that usually a character class means a character in the result (like a wordly or a digit), but not in this case.
The boundary has "zero width" in a sense that usually a character class means a character in the result (like a wordly character or a digit), but not in this case.
The boundary is a test.
When regular expression engine is doing the search, it's moving along the string in an attempt to find the match. At each string position it tries to find the pattern.
When the pattern contains `pattern:\b`, it tests that the position in string fits one of the conditions:
When the pattern contains `pattern:\b`, it tests that the position in string is a word boundary, that is one of three variants:
- String start, and the first string character is `\w`.
- String end, and the last string character is `\w`.
- Inside the string: from one side is `\w`, from the other side -- not `\w`.
- Immediately before is `\w`, and immediately after -- not `\w`, or vise versa.
- At string start, and the first string character is `\w`.
- At string end, and the last string character is `\w`.
For instance, in the string `subject:Hello, Java!` the following positions match `\b`:
![](hello-java-boundaries.png)
So it matches `pattern:\bHello\b` and `pattern:\bJava\b`, but not `pattern:\bHell\b` (because there's no word boundary after `l`) and not `Java!\b` (because the exclamation sign is not a wordly character, so there's no word boundary after it).
So it matches `pattern:\bHello\b`, because:
1. At the beginning of the string the first `\b` test matches.
2. Then the word `Hello` matches.
3. Then `\b` matches, as we're between `o` and a space.
Pattern `pattern:\bJava\b` also matches. But not `pattern:\bHell\b` (because there's no word boundary after `l`) and not `Java!\b` (because the exclamation sign is not a wordly character, so there's no word boundary after it).
```js run
alert( "Hello, Java!".match(/\bHello\b/) ); // Hello
alert( "Hello, Java!".match(/\bJava\b/) ); // Java
alert( "Hello, Java!".match(/\bHell\b/) ); // null
alert( "Hello, Java!".match(/\bJava!\b/) ); // null
alert( "Hello, Java!".match(/\bHell\b/) ); // null (no match)
alert( "Hello, Java!".match(/\bJava!\b/) ); // null (no match)
```
Once again let's note that `pattern:\b` makes the searching engine to test for the boundary, so that `pattern:Java\b` finds `match:Java` only when followed by a word boundary, but it does not add a letter to the result.
Once again let's note that `pattern:\b` makes the searching engine to test for the boundary, so that `pattern:Java\b` finds `match:Java` only when followed by a word boundary, but it does not add a letter to the result. §
Usually we use `\b` to find standalone English words. So that if we want `"Java"` language then `pattern:\bJava\b` finds exactly a standalone word and ignores it when it's a part of `"JavaScript"`.
@ -119,9 +128,9 @@ The word boundary check `\b` tests for a boundary between `\w` and something els
```
## Reverse classes
## Inverse classes
For every character class there exists a "reverse class", denoted with the same letter, but uppercased.
For every character class there exists an "inverse class", denoted with the same letter, but uppercased.
The "reverse" means that it matches all other characters, for instance:
@ -137,7 +146,9 @@ The "reverse" means that it matches all other characters, for instance:
`\B`
: Non-boundary: a test reverse to `\b`.
In the beginning of the chapter we saw how to get all digits from the phone `subject:+7(903)-123-45-67`. Let's get a "pure" phone number from the string:
In the beginning of the chapter we saw how to get all digits from the phone `subject:+7(903)-123-45-67`.
One way was to match all digits and join them:
```js run
let str = "+7(903)-123-45-67";
@ -145,7 +156,7 @@ let str = "+7(903)-123-45-67";
alert( str.match(/\d/g).join('') ); // 79031234567
```
An alternative way would be to find non-digits and remove them from the string:
An alternative, shorter way is to find non-digits `\D` and remove them from the string:
```js run
@ -156,11 +167,9 @@ alert( str.replace(/\D/g, "") ); // 79031234567
## Spaces are regular characters
Please note that regular expressions may include spaces. They are treated like regular characters.
Usually we pay little attention to spaces. For us strings `subject:1-5` and `subject:1 - 5` are nearly identical.
But if a regexp does not take spaces into account, it won' work.
But if a regexp doesn't take spaces into account, it may fail to work.
Let's try to find digits separated by a dash:
@ -168,23 +177,25 @@ Let's try to find digits separated by a dash:
alert( "1 - 5".match(/\d-\d/) ); // null, no match!
```
Here we fix it by adding spaces into the regexp:
Here we fix it by adding spaces into the regexp `pattern:\d - \d`:
```js run
alert( "1 - 5".match(/\d - \d/) ); // 1 - 5, now it works
```
Of course, spaces are needed only if we look for them. Extra spaces (just like any other extra characters) may prevent a match:
**A space is a character. Equal in importance with any other character.**
Of course, spaces in a regexp are needed only if we look for them. Extra spaces (just like any other extra characters) may prevent a match:
```js run
alert( "1-5".match(/\d - \d/) ); // null, because the string 1-5 has no spaces
```
In other words, in a regular expression all characters matter. Spaces too.
In other words, in a regular expression all characters matter, spaces too.
## A dot is any character
The dot `"."` is a special character class that matches *any character except a newline*.
The dot `"."` is a special character class that matches "any character except a newline".
For instance:
@ -208,19 +219,47 @@ Please note that the dot means "any character", but not the "absense of a charac
alert( "CS4".match(/CS.4/) ); // null, no match because there's no character for the dot
```
### The dotall "s" flag
Usually a dot doesn't match a newline character.
For instance, this doesn't match:
```js run
alert( "A\nB".match(/A.B/) ); // null (no match)
// a space character would match
// or a letter, but not \n
```
Sometimes it's inconvenient, we really want "any character", newline included.
That's what `s` flag does. If a regexp has it, then the dot `"."` match literally any character:
```js run
alert( "A\nB".match(/A.B/s) ); // A\nB (match!)
```
## Summary
We covered character classes:
There exist following character classes:
- `\d` -- digits.
- `\D` -- non-digits.
- `\s` -- space symbols, tabs, newlines.
- `\S` -- all but `\s`.
- `\w` -- English letters, digits, underscore `'_'`.
- `\W` -- all but `\w`.
- `'.'` -- any character except a newline.
- `pattern:\d` -- digits.
- `pattern:\D` -- non-digits.
- `pattern:\s` -- space symbols, tabs, newlines.
- `pattern:\S` -- all but `pattern:\s`.
- `pattern:\w` -- English letters, digits, underscore `'_'`.
- `pattern:\W` -- all but `pattern:\w`.
- `pattern:.` -- any character if with the regexp `'s'` flag, otherwise any except a newline.
If we want to search for a character that has a special meaning like a backslash or a dot, then we should escape it with a backslash: `pattern:\.`
...But that's not all!
Please note that a regexp may also contain string special characters such as a newline `\n`. There's no conflict with character classes, because other letters are used for them.
Modern Javascript also allows to look for characters by their Unicode properties, for instance:
- A cyrillic letter is: `pattern:\p{Script=Cyrillic}` or `pattern:\p{sc=Cyrillic}`.
- A dash (be it a small hyphen `-` or a long dash `—`): `pattern:\p{Dash_Punctuation}` or `pattern:\p{pd}`.
- A currency symbol: `pattern:\p{Currency_Symbol}` or `pattern:\p{sc}`.
- ...And much more. Unicode has a lot of character categories that we can select from.
These patterns require `'u'` regexp flag to work. More about that in the chapter [](info:regexp-unicode).

View file

@ -1,24 +1,24 @@
# Escaping, special characters
As we've seen, a backslash `"\"` is used to denote character classes. So it's a special character.
As we've seen, a backslash `"\"` is used to denote character classes. So it's a special character in regexps (just like in a regular string).
There are other special characters as well, that have special meaning in a regexp. They are used to do more powerful searches.
There are other special characters as well, that have special meaning in a regexp. They are used to do more powerful searches. Here's a full list of them: `pattern:[ \ ^ $ . | ? * + ( )`.
Here's a full list of them: `pattern:[ \ ^ $ . | ? * + ( )`.
Don't try to remember it -- when we deal with each of them separately, you'll know it by heart automatically.
Don't try to remember the list -- soon we'll deal with each of them separately and you'll know them by heart automatically.
## Escaping
To use a special character as a regular one, prepend it with a backslash.
Let's say we want to find a dot literally. Not "any character", but just a dot.
To use a special character as a regular one, prepend it with a backslash: `pattern:\.`.
That's also called "escaping a character".
For instance, we need to find a dot `pattern:'.'`. In a regular expression a dot means "any character except a newline", so if we really mean "a dot", let's put a backslash before it: `pattern:\.`.
For example:
```js run
alert( "Chapter 5.1".match(/\d\.\d/) ); // 5.1
alert( "Chapter 5.1".match(/\d\.\d/) ); // 5.1 (match!)
alert( "Chapter 511".match(/\d\.\d/) ); // null (looking for a real dot \.)
```
Parentheses are also special characters, so if we want them, we should use `pattern:\(`. The example below looks for a string `"g()"`:
@ -27,7 +27,7 @@ Parentheses are also special characters, so if we want them, we should use `patt
alert( "function g()".match(/g\(\)/) ); // "g()"
```
If we're looking for a backslash `\`, then we should double it:
If we're looking for a backslash `\`, it's a special character in both regular strings and regexps, so we should double it.
```js run
alert( "1\\2".match(/\\/) ); // '\'
@ -35,7 +35,7 @@ alert( "1\\2".match(/\\/) ); // '\'
## A slash
The slash symbol `'/'` is not a special character, but in JavaScript it is used to open and close the regexp: `pattern:/...pattern.../`, so we should escape it too.
A slash symbol `'/'` is not a special character, but in JavaScript it is used to open and close the regexp: `pattern:/...pattern.../`, so we should escape it too.
Here's what a search for a slash `'/'` looks like:
@ -43,7 +43,7 @@ Here's what a search for a slash `'/'` looks like:
alert( "/".match(/\//) ); // '/'
```
From the other hand, the alternative `new RegExp` syntaxes does not require escaping it:
From the other hand, if we're not using `/.../`, but create a regexp using `new RegExp`, then we no need to escape it:
```js run
alert( "/".match(new RegExp("/")) ); // '/'
@ -51,7 +51,7 @@ alert( "/".match(new RegExp("/")) ); // '/'
## new RegExp
If we are creating a regular expression with `new RegExp`, then we need to do some more escaping.
If we are creating a regular expression with `new RegExp`, then we don't have to escape `/`, but need to do some other escaping.
For instance, consider this:
@ -61,21 +61,23 @@ let reg = new RegExp("\d\.\d");
alert( "Chapter 5.1".match(reg) ); // null
```
It doesn't work, but why?
It worked with `pattern:/\d\.\d/`, but with `new RegExp("\d\.\d")` it doesn't, why?
The reason is string escaping rules. Look here:
The reason is that backslashes are "consumed" by a string. Remember, regular strings have their own special characters like `\n`, and a backslash is used for escaping.
Please, take a look, what "\d\.\d" really is:
```js run
alert("\d\.\d"); // d.d
```
Backslashes are used for escaping inside a string and string-specific special characters like `\n`. The quotes "consume" and interpret them, for instance:
The quotes "consume" backslashes and interpret them, for instance:
- `\n` -- becomes a newline character,
- `\u1234` -- becomes the Unicode character with such code,
- ...And when there's no special meaning: like `\d` or `\z`, then the backslash is simply removed.
So the call to `new RegExp` gets a string without backslashes.
So the call to `new RegExp` gets a string without backslashes. That's why it doesn't work!
To fix it, we need to double backslashes, because quotes turn `\\` into `\`:
@ -89,3 +91,9 @@ let reg = new RegExp(regStr);
alert( "Chapter 5.1".match(reg) ); // 5.1
```
## Summary
- To search special characters `pattern:[ \ ^ $ . | ? * + ( )` literally, we need to prepend them with `\` ("escape them").
- We also need to escape `/` if we're inside `pattern:/.../` (but not inside `new RegExp`).
- When passing a string `new RegExp`, we need to double backslashes `\\`, cause strings consume one of them.

View file

@ -1,69 +0,0 @@
# The unicode flag
The unicode flag `/.../u` enables the correct support of surrogate pairs.
Surrogate pairs are explained in the chapter <info:string>.
Let's briefly remind them here. In short, normally characters are encoded with 2 bytes. That gives us 65536 characters maximum. But there are more characters in the world.
So certain rare characters are encoded with 4 bytes, like `𝒳` (mathematical X) or `😄` (a smile).
Here are the unicode values to compare:
| Character | Unicode | Bytes |
|------------|---------|--------|
| `a` | 0x0061 | 2 |
| `≈` | 0x2248 | 2 |
|`𝒳`| 0x1d4b3 | 4 |
|`𝒴`| 0x1d4b4 | 4 |
|`😄`| 0x1f604 | 4 |
So characters like `a` and `≈` occupy 2 bytes, and those rare ones take 4.
The unicode is made in such a way that the 4-byte characters only have a meaning as a whole.
In the past JavaScript did not know about that, and many string methods still have problems. For instance, `length` thinks that here are two characters:
```js run
alert('😄'.length); // 2
alert('𝒳'.length); // 2
```
...But we can see that there's only one, right? The point is that `length` treats 4 bytes as two 2-byte characters. That's incorrect, because they must be considered only together (so-called "surrogate pair").
Normally, regular expressions also treat "long characters" as two 2-byte ones.
That leads to odd results, for instance let's try to find `pattern:[𝒳𝒴]` in the string `subject:𝒳`:
```js run
alert( '𝒳'.match(/[𝒳𝒴]/) ); // odd result
```
The result would be wrong, because by default the regexp engine does not understand surrogate pairs. It thinks that `[𝒳𝒴]` are not two, but four characters: the left half of `𝒳` `(1)`, the right half of `𝒳` `(2)`, the left half of `𝒴` `(3)`, the right half of `𝒴` `(4)`.
So it finds the left half of `𝒳` in the string `𝒳`, not the whole symbol.
In other words, the search works like `'12'.match(/[1234]/)` -- the `1` is returned (left half of `𝒳`).
The `/.../u` flag fixes that. It enables surrogate pairs in the regexp engine, so the result is correct:
```js run
alert( '𝒳'.match(/[𝒳𝒴]/u) ); // 𝒳
```
There's an error that may happen if we forget the flag:
```js run
'𝒳'.match(/[𝒳-𝒴]/); // SyntaxError: invalid range in character class
```
Here the regexp `[𝒳-𝒴]` is treated as `[12-34]` (where `2` is the right part of `𝒳` and `3` is the left part of `𝒴`), and the range between two halves `2` and `3` is unacceptable.
Using the flag would make it work right:
```js run
alert( '𝒴'.match(/[𝒳-𝒵]/u) ); // 𝒴
```
To finalize, let's note that if we do not deal with surrogate pairs, then the flag does nothing for us. But in the modern world we often meet them.

View file

@ -1,16 +1,18 @@
# Quantifiers +, *, ? and {n}
Let's say we have a string like `+7(903)-123-45-67` and want to find all numbers in it. But unlike before, we are interested in not digits, but full numbers: `7, 903, 123, 45, 67`.
Let's say we have a string like `+7(903)-123-45-67` and want to find all numbers in it. But unlike before, we are interested not in single digits, but full numbers: `7, 903, 123, 45, 67`.
A number is a sequence of 1 or more digits `\d`. The instrument to say how many we need is called *quantifiers*.
A number is a sequence of 1 or more digits `\d`. To mark how many we need, we need to append a *quantifier*.
## Quantity {n}
The most obvious quantifier is a number in figure quotes: `pattern:{n}`. A quantifier is put after a character (or a character class and so on) and specifies exactly how many we need.
The simplest quantifier is a number in curly braces: `pattern:{n}`.
It also has advanced forms, here we go with examples:
A quantifier is appended to a character (or a character class, or a `[...]` set etc) and specifies how many we need.
Exact count: `{5}`
It has a few advanced forms, let's see examples:
The exact count: `{5}`
: `pattern:\d{5}` denotes exactly 5 digits, the same as `pattern:\d\d\d\d\d`.
The example below looks for a 5-digit number:
@ -21,20 +23,24 @@ Exact count: `{5}`
We can add `\b` to exclude longer numbers: `pattern:\b\d{5}\b`.
The count from-to: `{3,5}`
: To find numbers from 3 to 5 digits we can put the limits into figure brackets: `pattern:\d{3,5}`
The range: `{3,5}`, match 3-5 times
: To find numbers from 3 to 5 digits we can put the limits into curly braces: `pattern:\d{3,5}`
```js run
alert( "I'm not 12, but 1234 years old".match(/\d{3,5}/) ); // "1234"
```
We can omit the upper limit. Then a regexp `pattern:\d{3,}` looks for numbers of `3` and more digits:
We can omit the upper limit.
Then a regexp `pattern:\d{3,}` looks for sequences of digits of length `3` or more:
```js run
alert( "I'm not 12, but 345678 years old".match(/\d{3,}/) ); // "345678"
```
In case with the string `+7(903)-123-45-67` we need numbers: one or more digits in a row. That is `pattern:\d{1,}`:
Let's return to the string `+7(903)-123-45-67`.
A number is a sequence of one or more digits in a row. So the regexp is `pattern:\d{1,}`:
```js run
let str = "+7(903)-123-45-67";
@ -46,7 +52,7 @@ alert(numbers); // 7,903,123,45,67
## Shorthands
Most often needed quantifiers have shorthands:
There are shorthands for most used quantifiers:
`+`
: Means "one or more", the same as `{1,}`.
@ -64,7 +70,7 @@ Most often needed quantifiers have shorthands:
For instance, the pattern `pattern:ou?r` looks for `match:o` followed by zero or one `match:u`, and then `match:r`.
So it can find `match:or` in the word `subject:color` and `match:our` in `subject:colour`:
So, `pattern:colou?r` finds both `match:color` and `match:colour`:
```js run
let str = "Should I write color or colour?";
@ -75,7 +81,7 @@ Most often needed quantifiers have shorthands:
`*`
: Means "zero or more", the same as `{0,}`. That is, the character may repeat any times or be absent.
The example below looks for a digit followed by any number of zeroes:
For example, `pattern:\d0*` looks for a digit followed by any number of zeroes:
```js run
alert( "100 10 1".match(/\d0*/g) ); // 100, 10, 1
@ -85,11 +91,12 @@ Most often needed quantifiers have shorthands:
```js run
alert( "100 10 1".match(/\d0+/g) ); // 100, 10
// 1 not matched, as 0+ requires at least one zero
```
## More examples
Quantifiers are used very often. They are one of the main "building blocks" for complex regular expressions, so let's see more examples.
Quantifiers are used very often. They serve as the main "building block" of complex regular expressions, so let's see more examples.
Regexp "decimal fraction" (a number with a floating point): `pattern:\d+\.\d+`
: In action:
@ -120,12 +127,12 @@ Regexp "opening or closing HTML-tag without attributes": `pattern:/<\/?[a-z][a-z
alert( "<h1>Hi!</h1>".match(/<\/?[a-z][a-z0-9]*>/gi) ); // <h1>, </h1>
```
```smart header="More precise means more complex"
```smart header="To make a regexp more precise, we often need make it more complex"
We can see one common rule in these examples: the more precise is the regular expression -- the longer and more complex it is.
For instance, HTML tags could use a simpler regexp: `pattern:<\w+>`.
For instance, for HTML tags we could use a simpler regexp: `pattern:<\w+>`.
Because `pattern:\w` means any English letter or a digit or `'_'`, the regexp also matches non-tags, for instance `match:<_>`. But it's much simpler than `pattern:<[a-z][a-z0-9]*>`.
...But because `pattern:\w` means any English letter or a digit or `'_'`, the regexp also matches non-tags, for instance `match:<_>`. So it's much simpler than `pattern:<[a-z][a-z0-9]*>`, but less reliable.
Are we ok with `pattern:<\w+>` or we need `pattern:<[a-z][a-z0-9]*>`?

View file

@ -8,15 +8,13 @@ Let's take the following task as an example.
We have a text and need to replace all quotes `"..."` with guillemet marks: `«...»`. They are preferred for typography in many countries.
For instance: `"Hello, world"` should become `«Hello, world»`.
For instance: `"Hello, world"` should become `«Hello, world»`. Some countries prefer other quotes, like `„Witam, świat!”` (Polish) or `「你好,世界」` (Chinese), but for our task let's choose `«...»`.
Some countries prefer `„Witam, świat!”` (Polish) or even `「你好,世界」` (Chinese) quotes. For different locales we can choose different replacements, but that all works the same, so let's start with `«...»`.
The first thing to do is to locate quoted strings, and then we can replace them.
To make replacements we first need to find all quoted substrings.
A regular expression like `pattern:/".+"/g` (a quote, then something, then the other quote) may seem like a good fit, but it isn't!
The regular expression can look like this: `pattern:/".+"/g`. That is: we look for a quote followed by one or more characters, and then another quote.
...But if we try to apply it, even in such a simple case...
Let's try it:
```js run
let reg = /".+"/g;
@ -193,7 +191,7 @@ Please note, that this logic does not replace lazy quantifiers!
It is just different. There are times when we need one or another.
Let's see one more example where lazy quantifiers fail and this variant works right.
**Let's see an example where lazy quantifiers fail and this variant works right.**
For instance, we want to find links of the form `<a href="..." class="doc">`, with any `href`.
@ -210,7 +208,7 @@ let reg = /<a href=".*" class="doc">/g;
alert( str.match(reg) ); // <a href="link" class="doc">
```
...But what if there are many links in the text?
It worked. But let's see what happens if there are many links in the text?
```js run
let str = '...<a href="link1" class="doc">... <a href="link2" class="doc">...';
@ -239,14 +237,14 @@ let reg = /<a href=".*?" class="doc">/g;
alert( str.match(reg) ); // <a href="link1" class="doc">, <a href="link2" class="doc">
```
Now it works, there are two matches:
Now it seems to work, there are two matches:
```html
<a href="....." class="doc"> <a href="....." class="doc">
<a href="link1" class="doc">... <a href="link2" class="doc">
```
Why it works -- should be obvious after all explanations above. So let's not stop on the details, but try one more text:
...But let's test it on one more text input:
```js run
let str = '...<a href="link1" class="wrong">... <p style="" class="doc">...';
@ -256,24 +254,24 @@ let reg = /<a href=".*?" class="doc">/g;
alert( str.match(reg) ); // <a href="link1" class="wrong">... <p style="" class="doc">
```
We can see that the regexp matched not just a link, but also a lot of text after it, including `<p...>`.
Now it fails. The match includes not just a link, but also a lot of text after it, including `<p...>`.
Why it happens?
Why?
That's what's going on:
1. First the regexp finds a link start `match:<a href="`.
2. Then it looks for `pattern:.*?`: takes one character (lazily!), check if there's a match for `pattern:" class="doc">` (none).
3. Then takes another character into `pattern:.*?`, and so on... until it finally reaches `match:" class="doc">`.
2. Then it looks for `pattern:.*?`, we take one character, then check if there's a match for the rest of the pattern, then take another one...
But the problem is: that's already beyound the link, in another tag `<p>`. Not what we want.
The quantifier `pattern:.*?` consumes characters until it meets `match:class="doc">`.
Here's the picture of the match aligned with the text:
...And where can it find it? If we look at the text, then we can see that the only `match:class="doc">` is beyond the link, in the tag `<p>`.
3. So we have match:
```html
<a href="..................................." class="doc">
<a href="link1" class="wrong">... <p style="" class="doc">
```
```html
<a href="..................................." class="doc">
<a href="link1" class="wrong">... <p style="" class="doc">
```
So the laziness did not work for us here.

View file

@ -4,7 +4,7 @@ A part of a pattern can be enclosed in parentheses `pattern:(...)`. This is call
That has two effects:
1. It allows to place a part of the match into a separate array item when using [String#match](mdn:js/String/match) or [RegExp#exec](mdn:/RegExp/exec) methods.
1. It allows to place a part of the match into a separate array.
2. If we put a quantifier after the parentheses, it applies to the parentheses as a whole, not the last character.
## Example
@ -30,32 +30,30 @@ john.smith@site.com.uk
The pattern: `pattern:[-.\w]+@([\w-]+\.)+[\w-]{2,20}`.
- The first part before `@` may include any alphanumeric word characters, a dot and a dash `pattern:[-.\w]+`, like `match:john.smith`.
- Then `pattern:@`
- And then the domain and maybe a second-level domain like `site.com` or with subdomains like `host.site.com.uk`. We can match it as "a word followed by a dot" repeated one or more times for subdomains: `match:mail.` or `match:site.com.`, and then "a word" for the last part: `match:.com` or `match:.uk`.
1. The first part `pattern:[-.\w]+` (before `@`) may include any alphanumeric word characters, a dot and a dash, to match `match:john.smith`.
2. Then `pattern:@`, and the domain. It may be a subdomain like `host.site.com.uk`, so we match it as "a word followed by a dot `pattern:([\w-]+\.)` (repeated), and then the last part must be a word: `match:com` or `match:uk` (but not very long: 2-20 characters).
The word followed by a dot is `pattern:(\w+\.)+` (repeated). The last word should not have a dot at the end, so it's just `\w{2,20}`. The quantifier `pattern:{2,20}` limits the length, because domain zones are like `.uk` or `.com` or `.museum`, but can't be longer than 20 characters.
That regexp is not perfect, but good enough to fix errors or occasional mistypes.
So the domain pattern is `pattern:(\w+\.)+\w{2,20}`. Now we replace `\w` with `[\w-]`, because dashes are also allowed in domains, and we get the final result.
That regexp is not perfect, but usually works. It's short and good enough to fix errors or occasional mistypes.
For instance, here we can find all emails in the string:
For instance, we can find all emails in the string:
```js run
let reg = /[-.\w]+@([\w-]+\.)+[\w-]{2,20}/g;
alert("my@mail.com @ his@site.com.uk".match(reg)); // my@mail.com,his@site.com.uk
alert("my@mail.com @ his@site.com.uk".match(reg)); // my@mail.com, his@site.com.uk
```
In this example parentheses were used to make a group for repeating `pattern:(...)+`. But there are other uses too, let's see them.
## Contents of parentheses
Parentheses are numbered from left to right. The search engine remembers the content of each and allows to reference it in the pattern or in the replacement string.
For instance, we can find an HTML-tag using a (simplified) pattern `pattern:<.*?>`. Usually we'd want to do something with the result after it.
For instance, we'd like to find HTML tags `pattern:<.*?>`, and process them.
If we enclose the inner contents of `<...>` into parentheses, then we can access it like this:
Let's wrap the inner content into parentheses, like this: `pattern:<(.*?)>`.
We'll get them into an array:
```js run
let str = '<h1>Hello, world!</h1>';
@ -66,7 +64,7 @@ alert( str.match(reg) ); // Array: ["<h1>", "h1"]
The call to [String#match](mdn:js/String/match) returns groups only if the regexp has no `pattern:/.../g` flag.
If we need all matches with their groups then we can use [RegExp#exec](mdn:js/RegExp/exec) method as described in <info:regexp-methods>:
If we need all matches with their groups then we can use `.matchAll` or `regexp.exec` as described in <info:regexp-methods>:
```js run
let str = '<h1>Hello, world!</h1>';
@ -74,13 +72,10 @@ let str = '<h1>Hello, world!</h1>';
// two matches: opening <h1> and closing </h1> tags
let reg = /<(.*?)>/g;
let match;
let matches = Array.from( str.matchAll(reg) );
while (match = reg.exec(str)) {
// first shows the match: <h1>,h1
// then shows the match: </h1>,/h1
alert(match);
}
alert(matches[0]); // Array: ["<h1>", "h1"]
alert(matches[1]); // Array: ["</h1>", "/h1"]
```
Here we have two matches for `pattern:<(.*?)>`, each of them is an array with the full match and groups.
@ -146,13 +141,78 @@ alert( match[2] ); // c
The array length is permanent: `3`. But there's nothing for the group `pattern:(z)?`, so the result is `["ac", undefined, "c"]`.
## Named groups
Remembering groups by their numbers is hard. For simple patterns it's doable, but for more complex ones we can give names to parentheses.
That's done by putting `pattern:?<name>` immediately after the opening paren, like this:
```js run
*!*
let dateRegexp = /(?<year>[0-9]{4})-(?<month>[0-9]{2})-(?<day>[0-9]{2})/;
*/!*
let str = "2019-04-30";
let groups = str.match(dateRegexp).groups;
alert(groups.year); // 2019
alert(groups.month); // 04
alert(groups.day); // 30
```
As you can see, the groups reside in the `.groups` property of the match.
Wee can also use them in replacements, as `pattern:$<name>` (like `$1..9`, but name instead of a digit).
For instance, let's rearrange the date into `day.month.year`:
```js run
let dateRegexp = /(?<year>[0-9]{4})-(?<month>[0-9]{2})-(?<day>[0-9]{2})/;
let str = "2019-04-30";
let rearranged = str.replace(dateRegexp, '$<day>.$<month>.$<year>');
alert(rearranged); // 30.04.2019
```
If we use a function, then named `groups` object is always the last argument:
```js run
let dateRegexp = /(?<year>[0-9]{4})-(?<month>[0-9]{2})-(?<day>[0-9]{2})/;
let str = "2019-04-30";
let rearranged = str.replace(dateRegexp,
(str, year, month, day, offset, input, groups) =>
`${groups.day}.${groups.month}.${groups.year}`
);
alert(rearranged); // 30.04.2019
```
Usually, when we intend to use named groups, we don't need positional arguments of the function. For the majority of real-life cases we only need `str` and `groups`.
So we can write it a little bit shorter:
```js
let rearranged = str.replace(dateRegexp, (str, ...args) => {
let {year, month, day} = args.pop();
alert(str); // 2019-04-30
alert(year); // 2019
alert(month); // 04
alert(day); // 30
});
```
## Non-capturing groups with ?:
Sometimes we need parentheses to correctly apply a quantifier, but we don't want their contents in the array.
Sometimes we need parentheses to correctly apply a quantifier, but we don't want the contents in results.
A group may be excluded by adding `pattern:?:` in the beginning.
For instance, if we want to find `pattern:(go)+`, but don't want to put remember the contents (`go`) in a separate array item, we can write: `pattern:(?:go)+`.
For instance, if we want to find `pattern:(go)+`, but don't want to remember the contents (`go`) in a separate array item, we can write: `pattern:(?:go)+`.
In the example below we only get the name "John" as a separate member of the `results` array:
@ -168,3 +228,10 @@ let result = str.match(reg);
alert( result.length ); // 2
alert( result[1] ); // John
```
## Summary
- Parentheses can be:
- capturing `(...)`, ordered left-to-right, accessible by number.
- named capturing `(?<name>...)`, accessible by name.
- non-capturing `(?:...)`, used only to apply quantifier to the whole groups.

View file

@ -1,36 +1,21 @@
# Backreferences: \n and $n
# Backreferences in pattern: \n and \k
Capturing groups may be accessed not only in the result, but in the replacement string, and in the pattern too.
Capturing groups can be accessed not only in the result or in the replacement string, but also in the pattern itself.
## Group in replacement: $n
## Backreference by number: \n
When we are using `replace` method, we can access n-th group in the replacement string using `$n`.
A group can be referenced in the pattern using `\n`, where `n` is the group number.
For instance:
To make things clear let's consider a task.
```js run
let name = "John Smith";
name = name.replace(/(\w+) (\w+)/i, *!*"$2, $1"*/!*);
alert( name ); // Smith, John
```
Here `pattern:$1` in the replacement string means "substitute the content of the first group here", and `pattern:$2` means "substitute the second group here".
Referencing a group in the replacement string allows us to reuse the existing text during the replacement.
## Group in pattern: \n
A group can be referenced in the pattern using `\n`.
To make things clear let's consider a task. We need to find a quoted string: either a single-quoted `subject:'...'` or a double-quoted `subject:"..."` -- both variants need to match.
We need to find a quoted string: either a single-quoted `subject:'...'` or a double-quoted `subject:"..."` -- both variants need to match.
How to look for them?
We can put two kinds of quotes in the pattern: `pattern:['"](.*?)['"]`. That finds strings like `match:"..."` and `match:'...'`, but it gives incorrect matches when one quote appears inside another one, like the string `subject:"She's the one!"`:
We can put two kinds of quotes in the pattern: `pattern:['"](.*?)['"]`, but it would find strings with mixed quotes, like `match:"...'` and `match:'..."`. That would lead to incorrect matches when one quote appears inside other ones, like the string `subject:"She's the one!"`:
```js run
let str = "He said: \"She's the one!\".";
let str = `He said: "She's the one!".`;
let reg = /['"](.*?)['"]/g;
@ -40,21 +25,41 @@ alert( str.match(reg) ); // "She'
As we can see, the pattern found an opening quote `match:"`, then the text is consumed lazily till the other quote `match:'`, that closes the match.
To make sure that the pattern looks for the closing quote exactly the same as the opening one, let's make a group of it and use the backreference:
To make sure that the pattern looks for the closing quote exactly the same as the opening one, we can make a groups of it and use the backreference.
Here's the correct code:
```js run
let str = "He said: \"She's the one!\".";
let str = `He said: "She's the one!".`;
*!*
let reg = /(['"])(.*?)\1/g;
*/!*
alert( str.match(reg) ); // "She's the one!"
```
Now everything's correct! The regular expression engine finds the first quote `pattern:(['"])` and remembers the content of `pattern:(...)`, that's the first capturing group.
Now it works! The regular expression engine finds the first quote `pattern:(['"])` and remembers the content of `pattern:(...)`, that's the first capturing group.
Further in the pattern `pattern:\1` means "find the same text as in the first group".
Further in the pattern `pattern:\1` means "find the same text as in the first group", exactly the same quote in our case.
Please note:
- To reference a group inside a replacement string -- we use `$1`, while in the pattern -- a backslash `\1`.
- If we use `?:` in the group, then we can't reference it. Groups that are excluded from capturing `(?:...)` are not remembered by the engine.
## Backreference by name: `\k<name>`
For named groups, we can backreference by `\k<name>`.
The same example with the named group:
```js run
let str = `He said: "She's the one!".`;
*!*
let reg = /(?<quote>['"])(.*?)\k<quote>/g;
*/!*
alert( str.match(reg) ); // "She's the one!"
```

View file

@ -0,0 +1,156 @@
# Unicode: flag "u", character properties "\\p"
The unicode flag `/.../u` enables the correct support of surrogate pairs.
Surrogate pairs are explained in the chapter <info:string>.
Let's briefly remind them here. In short, normally characters are encoded with 2 bytes. That gives us 65536 characters maximum. But there are more characters in the world.
So certain rare characters are encoded with 4 bytes, like `𝒳` (mathematical X) or `😄` (a smile).
Here are the unicode values to compare:
| Character | Unicode | Bytes |
|------------|---------|--------|
| `a` | 0x0061 | 2 |
| `≈` | 0x2248 | 2 |
|`𝒳`| 0x1d4b3 | 4 |
|`𝒴`| 0x1d4b4 | 4 |
|`😄`| 0x1f604 | 4 |
So characters like `a` and `≈` occupy 2 bytes, and those rare ones take 4.
The unicode is made in such a way that the 4-byte characters only have a meaning as a whole.
In the past JavaScript did not know about that, and many string methods still have problems. For instance, `length` thinks that here are two characters:
```js run
alert('😄'.length); // 2
alert('𝒳'.length); // 2
```
...But we can see that there's only one, right? The point is that `length` treats 4 bytes as two 2-byte characters. That's incorrect, because they must be considered only together (so-called "surrogate pair").
Normally, regular expressions also treat "long characters" as two 2-byte ones.
That leads to odd results, for instance let's try to find `pattern:[𝒳𝒴]` in the string `subject:𝒳`:
```js run
alert( '𝒳'.match(/[𝒳𝒴]/) ); // odd result (wrong match actually, "half-character")
```
The result is wrong, because by default the regexp engine does not understand surrogate pairs.
So, it thinks that `[𝒳𝒴]` are not two, but four characters:
1. the left half of `𝒳` `(1)`,
2. the right half of `𝒳` `(2)`,
3. the left half of `𝒴` `(3)`,
4. the right half of `𝒴` `(4)`.
We can list them like this:
```js run
for(let i=0; i<'𝒳𝒴'.length; i++) {
alert('𝒳𝒴'.charCodeAt(i)); // 55349, 56499, 55349, 56500
};
```
So it finds only the "left half" of `𝒳`.
In other words, the search works like `'12'.match(/[1234]/)`: only `1` is returned.
## The "u" flag
The `/.../u` flag fixes that.
It enables surrogate pairs in the regexp engine, so the result is correct:
```js run
alert( '𝒳'.match(/[𝒳𝒴]/u) ); // 𝒳
```
Let's see one more example.
If we forget the `u` flag and occasionally use surrogate pairs, then we can get an error:
```js run
'𝒳'.match(/[𝒳-𝒴]/); // SyntaxError: invalid range in character class
```
Normally, regexps understand `[a-z]` as a "range of characters with codes between codes of `a` and `z`.
But without `u` flag, surrogate pairs are assumed to be a "pair of independant characters", so `[𝒳-𝒴]` is like `[<55349><56499>-<55349><56500>]` (replaced each surrogate pair with code points). Now we can clearly see that the range `56499-55349` is unacceptable, as the left range border must be less than the right one.
Using the `u` flag makes it work right:
```js run
alert( '𝒴'.match(/[𝒳-𝒵]/u) ); // 𝒴
```
## Unicode character properies
[Unicode](https://en.wikipedia.org/wiki/Unicode), the encoding format used by Javascript strings, has a lot of properties for different characters (or, technically, code points). They describe which "categories" character belongs to, and a variety of technical details.
In regular expressions these can be set by `\p{…}`.
For instance, `\p{Letter}` denotes a letter in any of language. We can also use `\p{L}`, as `L` is an alias of `Letter`, there are shorter aliases for almost every property.
Here's the main tree of properties:
- Letter `L`:
- lowercase `Ll`, modifier `Lm`, titlecase `Lt`, uppercase `Lu`, other `Lo`
- Number `N`:
- decimal digit `Nd`, letter number `Nl`, other `No`:
- Punctuation `P`:
- connector `Pc`, dash `Pd`, initial quote `Pi`, final quote `Pf`, open `Ps`, close `Pe`, other `Po`
- Mark `M` (accents etc):
- spacing combining `Mc`, enclosing `Me`, non-spacing `Mn`
- Symbol `S`:
- currency `Sc`, modifier `Sk`, math `Sm`, other `So`
- Separator `Z`:
- line `Zl`, paragraph `Zp`, space `Zs`
- Other `C`:
- control `Cc`, format `Cf`, not assigned `Cn`, private use `Co`, surrogate `Cs`.
```smart header="More information"
Interested to see which characters belong to a property? There's a tool at <http://cldr.unicode.org/unicode-utilities/list-unicodeset> for that.
You could also explore properties at [Character Property Index](http://unicode.org/cldr/utility/properties.jsp).
For the full Unicode Character Database in text format (along with all properties), see <https://www.unicode.org/Public/UCD/latest/ucd/>.
```
There are also other derived categories, like `Alphabetic` (`Alpha`), that includes Letters `L`, plus letter numbers `Nl`, plus some other symbols `Other_Alphabetic` (`OAltpa`).
Unicode is a big beast, it includes a lot of properties.
One of properties is `Script` (`sc`), a collection of letters and other written signs used to represent textual information in one or more writing systems. There are about 150 scripts, including Cyrillic, Greek, Arabic, Han (Chinese) etc, the [list is long]("https://en.wikipedia.org/wiki/Script_(Unicode)").
The `Script` property needs a value, e.g. to search for cyrillic letters: `\p{sc=Cyrillic}`.
Let's make a "universal" regexp for `pattern:\w`, for any language. That task has a standard solution in many programming languages with unicode-aware regexps, e.g. Perl.
```
/[\p{Alphabetic}\p{Mark}\p{Decimal_Number}\p{Connector_Punctuation}\p{Join_Control}]/u
```
Let's decipher. Remember, `pattern:\w` is actually the same as `pattern:[a-zA-Z0-9_]`.
So the character set includes:
- `Alphabetic` for letters,
- `Mark` for accents, as in Unicode accents may be represented by separate code points,
- `Decimal_Number` for numbers,
- `Connector_Punctuation` for the `'_'` character and alike,
- `Join_Control` - two special code points with hex codes `200c` and `200d`, used in ligatures e.g. in arabic.
Or, if we replace long names with aliases (a list of aliases [here](https://www.unicode.org/Public/UCD/latest/ucd/PropertyValueAliases.txt)):
```js run
let regexp = /([\p{Alpha}\p{M}\p{Nd}\p{Pc}\p{Join_C}]+)/gu;
let str = `Hello Привет 你好 123_456`;
alert( str.match(regexp) ); // Hello,Привет,你好,123_456
```

View file

@ -0,0 +1,71 @@
# "Sticky" flag `y`, searching at position [#y-flag]
To grasp the use case of `y` flag, and see how great it is, let's explore a practical use case.
One of common tasks for regexps is "parsing": when we get a text and analyze it for logical components, build a structure.
For instance, there are HTML parsers for browser pages, that turn text into a structured document. There are parsers for programming languages, like Javascript, etc.
Writing parsers is a special area, with its own tools and algorithms, so we don't go deep in there, but there's a very common question: "What is the text at the given position?".
For instance, for a programming language variants can be like:
- Is it a "name" `pattern:\w+`?
- Or is it a number `pattern:\d+`?
- Or an operator `pattern:[+-/*]`?
- (a syntax error if it's not anything in the expected list)
In Javascript, to perform a search starting from a given position, we can use `regexp.exec` with `regexp.lastIndex` property, but that's not we need!
We'd like to check the match exactly at given position, not "starting" from it.
Here's a (failing) attempt to use `lastIndex`:
```js run
let str = "(text before) function ...";
// attempting to find function at position 5:
let regexp = /function/g; // must use "g" flag, otherwise lastIndex is ignored
regexp.lastIndex = 5
alert (regexp.exec(str)); // function
```
The match is found, because `regexp.exec` starts to search from the given position and goes on by the text, successfully matching "function" later.
We could work around that by checking if "`regexp.exec(str).index` property is `5`, and if not, ignore the much. But the main problem here is performance.
The regexp engine does a lot of unnecessary work by scanning at further positions. The delays are clearly noticeable if the text is long, because there are many such searches in a parser.
## The "y" flag
So we've came to the problem: how to search for a match, starting exactly at the given position.
That's what `y` flag does. It makes the regexp search only at the `lastIndex` position.
Here's an example
```js run
let str = "(text before) function ...";
*!*
let regexp = /function/y;
regexp.lastIndex = 5;
*/!*
alert (regexp.exec(str)); // null (no match, unlike "g" flag!)
*!*
regexp.lastIndex = 14;
*/!*
alert (regexp.exec(str)); // function (match!)
```
As we can see, now the regexp is only matched at the given position.
So what `y` does is truly unique, and very important for writing parsers.
The `y` flag allows to apply a regular expression (or many of them one-by-one) exactly at the given position and when we understand what's there, we can move on -- step by step examining the text.
Without the flag the regexp engine always searches till the end of the text, that takes time, especially if the text is large. So our parser would be very slow. The `y` flag is exactly the right thing here.