This commit is contained in:
Ilya Kantor 2019-09-04 15:44:48 +03:00
parent ef370b6ace
commit f21cb0a2f4
71 changed files with 707 additions and 727 deletions

View file

@ -2,7 +2,7 @@
Regular expressions is a powerful way to search and replace in text.
In JavaScript, they are available as `RegExp` object, and also integrated in methods of strings.
In JavaScript, they are available as [RegExp](mdn:js/RegExp) object, and also integrated in methods of strings.
## Regular Expressions
@ -23,35 +23,43 @@ regexp = /pattern/; // no flags
regexp = /pattern/gmi; // with flags g,m and i (to be covered soon)
```
Slashes `"/"` tell JavaScript that we are creating a regular expression. They play the same role as quotes for strings.
Slashes `pattern:/.../` tell JavaScript that we are creating a regular expression. They play the same role as quotes for strings.
## Usage
In both cases `regexp` becomes an object of the built-in `RegExp` class.
To search inside a string, we can use method [search](mdn:js/String/search).
The main difference between these two syntaxes is that slashes `pattern:/.../` do not allow to insert expressions (like strings with `${...}`). They are fully static.
Here's an example:
Slashes are used when we know the regular expression at the code writing time -- and that's the most common situation. While `new RegExp` is used when we need to create a regexp "on the fly", from a dynamically generated string, for instance:
```js run
let str = "I love JavaScript!"; // will search here
```js
let tag = prompt("What tag do you want to find?", "h2");
let regexp = /love/;
alert( str.search(regexp) ); // 2
let regexp = new RegExp(`<${tag}>`); // same as /<h2>/ if answered "h2" in the prompt above
```
The `str.search` method looks for the pattern `pattern:/love/` and returns the position inside the string. As we might guess, `pattern:/love/` is the simplest possible pattern. What it does is a simple substring search.
## Flags
The code above is the same as:
Regular expressions may have flags that affect the search.
```js run
let str = "I love JavaScript!"; // will search here
There are only 6 of them in JavaScript:
let substr = 'love';
alert( str.search(substr) ); // 2
```
`pattern:i`
: With this flag the search is case-insensitive: no difference between `A` and `a` (see the example below).
So searching for `pattern:/love/` is the same as searching for `"love"`.
`pattern:g`
: With this flag the search looks for all matches, without it -- only the first one.
But that's only for now. Soon we'll create more complex regular expressions with much more searching power.
`pattern:m`
: Multiline mode (covered in the chapter <info:regexp-multiline-mode>).
`pattern:s`
: Enables "dotall" mode, that allows a dot `pattern:.` to match newline character `\n` (covered in the chapter <info:regexp-character-classes>).
`pattern:u`
: Enables full unicode support. The flag enables correct processing of surrogate pairs. More about that in the chapter <info:regexp-unicode>.
`pattern:y`
: "Sticky" mode: searching at the exact position in the text (covered in the chapter <info:regexp-sticky>)
```smart header="Colors"
From here on the color scheme is:
@ -61,65 +69,109 @@ From here on the color scheme is:
- result -- `match:green`
```
## Searching: str.match
````smart header="When to use `new RegExp`?"
Normally we use the short syntax `/.../`. But it does not support variable insertions `${...}`.
As it was said previously, regular expressions are integrated with string methods.
On the other hand, `new RegExp` allows to construct a pattern dynamically from a string, so it's more flexible.
The method `str.match(regexp)` finds all matches of `regexp` in the string `str`.
Here's an example of a dynamically generated regexp:
It has 3 working modes:
1. If the regular expression has flag `pattern:g`, it returns an array of all matches:
```js run
let str = "We will, we will rock you";
alert( str.match(/we/gi) ); // We,we (an array of 2 matches)
```
Please note that both `match:We` and `match:we` are found, because flag `pattern:i` makes the regular expression case-insensitive.
2. If there's no such flag it returns only the first match in the form of an array, with the full match at index `0` and some additional details in properties:
```js run
let str = "We will, we will rock you";
let result = str.match(/we/i); // without flag g
alert( result[0] ); // We (1st match)
alert( result.length ); // 1
// Details:
alert( result.index ); // 0 (position of the match)
alert( result.input ); // We will, we will rock you (source string)
```
The array may have other indexes, besides `0` if a part of the regular expression is enclosed in parentheses. We'll cover that in the chapter <info:regexp-groups>.
3. And, finally, if there are no matches, `null` is returned (doesn't matter if there's flag `pattern:g` or not).
That's a very important nuance. If there are no matches, we get not an empty array, but `null`. Forgetting about that may lead to errors, e.g.:
```js run
let matches = "JavaScript".match(/HTML/); // = null
if (!matches.length) { // Error: Cannot read property 'length' of null
alert("Error in the line above");
}
```
If we'd like the result to be always an array, we can write it this way:
```js run
let matches = "JavaScript".match(/HTML/)*!* || []*/!*;
if (!matches.length) {
alert("No matches"); // now it works
}
```
## Replacing: str.replace
The method `str.replace(regexp, replacement)` replaces matches with `regexp` in string `str` with `replacement` (all matches, if there's flag `pattern:g`, otherwise only the first one).
For instance:
```js run
let tag = prompt("Which tag you want to search?", "h2");
let regexp = new RegExp(`<${tag}>`);
// no flag g
alert( "We will, we will".replace(/we/i, "I") ); // I will, we will
// finds <h2> by default
alert( "<h1> <h2> <h3>".search(regexp));
// with flag g
alert( "We will, we will".replace(/we/ig, "I") ); // I will, I will
```
````
The second argument is the `replacement` string. We can use special character combinations in it to insert fragments of the match:
## Flags
| Symbols | Action in the replacement string |
|--------|--------|
|`$&`|inserts the whole match|
|<code>$&#096;</code>|inserts a part of the string before the match|
|`$'`|inserts a part of the string after the match|
|`$n`|if `n` is a 1-2 digit number, then it inserts the contents of n-th parentheses, more about it in the chapter <info:regexp-groups>|
|`$<name>`|inserts the contents of the parentheses with the given `name`, more about it in the chapter <info:regexp-groups>|
|`$$`|inserts character `$` |
Regular expressions may have flags that affect the search.
There are only 6 of them in JavaScript:
`i`
: With this flag the search is case-insensitive: no difference between `A` and `a` (see the example below).
`g`
: With this flag the search looks for all matches, without it -- only the first one (we'll see uses in the next chapter).
`m`
: Multiline mode (covered in the chapter <info:regexp-multiline-mode>).
`s`
: "Dotall" mode, allows `.` to match newlines (covered in the chapter <info:regexp-character-classes>).
`u`
: Enables full unicode support. The flag enables correct processing of surrogate pairs. More about that in the chapter <info:regexp-unicode>.
`y`
: Sticky mode (covered in the chapter <info:regexp-sticky>)
We'll cover all these flags further in the tutorial.
For now, the simplest flag is `i`, here's an example:
An example with `pattern:$&`:
```js run
let str = "I love JavaScript!";
alert( str.search(/LOVE/i) ); // 2 (found lowercased)
alert( str.search(/LOVE/) ); // -1 (nothing found without 'i' flag)
alert( "I love HTML".replace(/HTML/, "$& and JavaScript") ); // I love HTML and JavaScript
```
So the `i` flag already makes regular expressions more powerful than a simple substring search. But there's so much more. We'll cover other flags and features in the next chapters.
## Testing: regexp.test
The method `regexp.test(str)` looks for at least one match, if found, returns `true`, otherwise `false`.
```js run
let str = "I love JavaScript";
let reg = /LOVE/i;
alert( reg.test(str) ); // true
```
Further in this chapter we'll study more regular expressions, come across many other examples and also meet other methods.
Full information about the methods is given in the article <info:regexp-methods>.
## Summary
- A regular expression consists of a pattern and optional flags: `g`, `i`, `m`, `u`, `s`, `y`.
- Without flags and special symbols that we'll study later, the search by a regexp is the same as a substring search.
- The method `str.search(regexp)` returns the index where the match is found or `-1` if there's no match. In the next chapter we'll see other methods.
- A regular expression consists of a pattern and optional flags: `pattern:g`, `pattern:i`, `pattern:m`, `pattern:u`, `pattern:s`, `pattern:y`.
- Without flags and special symbols that we'll study later, the search by a regexp is the same as a substring search.
- The method `str.match(regexp)` looks for matches: all of them if there's `pattern:g` flag, otherwise only the first one.
- The method `str.replace(regexp, replacement)` replaces matches with `regexp` by `replacement`: all of them if there's `pattern:g` flag, otherwise only the first one.
- The method `regexp.test(str)` returns `true` if there's at least one match, otherwise `false`.

View file

@ -0,0 +1,189 @@
# Character classes
Consider a practical task -- we have a phone number like `"+7(903)-123-45-67"`, and we need to turn it into pure numbers: `79035419441`.
To do so, we can find and remove anything that's not a number. Character classes can help with that.
A *character class* is a special notation that matches any symbol from a certain set.
For the start, let's explore the "digit" class. It's written as `pattern:\d` and corresponds to "any single digit".
For instance, the let's find the first digit in the phone number:
```js run
let str = "+7(903)-123-45-67";
let reg = /\d/;
alert( str.match(reg) ); // 7
```
Without the flag `pattern:g`, the regular expression only looks for the first match, that is the first digit `pattern:\d`.
Let's add the `pattern:g` flag to find all digits:
```js run
let str = "+7(903)-123-45-67";
let reg = /\d/g;
alert( str.match(reg) ); // array of matches: 7,9,0,3,1,2,3,4,5,6,7
// let's make the digits-only phone number of them:
alert( str.match(reg).join('') ); // 79035419441
```
That was a character class for digits. There are other character classes as well.
Most used are:
`pattern:\d` ("d" is from "digit")
: A digit: a character from `0` to `9`.
`pattern:\s` ("s" is from "space")
: A space symbol: includes spaces, tabs `\t`, newlines `\n` and few other rare characters: `\v`, `\f` and `\r`.
`pattern:\w` ("w" is from "word")
: A "wordly" character: either a letter of Latin alphabet or a digit or an underscore `_`. Non-Latin letters (like cyrillic or hindi) do not belong to `pattern:\w`.
For instance, `pattern:\d\s\w` means a "digit" followed by a "space character" followed by a "wordly character", such as `match:1 a`.
**A regexp may contain both regular symbols and character classes.**
For instance, `pattern:CSS\d` matches a string `match:CSS` with a digit after it:
```js run
let str = "Is there CSS4?";
let reg = /CSS\d/
alert( str.match(reg) ); // CSS4
```
Also we can use many character classes:
```js run
alert( "I love HTML5!".match(/\s\w\w\w\w\d/) ); // ' HTML5'
```
The match (each regexp character class has the corresponding result character):
![](love-html5-classes.svg)
## Inverse classes
For every character class there exists an "inverse class", denoted with the same letter, but uppercased.
The "inverse" means that it matches all other characters, for instance:
`pattern:\D`
: Non-digit: any character except `pattern:\d`, for instance a letter.
`pattern:\S`
: Non-space: any character except `pattern:\s`, for instance a letter.
`pattern:\W`
: Non-wordly character: anything but `pattern:\w`, e.g a non-latin letter or a space.
In the beginning of the chapter we saw how to make a number-only phone number from a string like `subject:+7(903)-123-45-67`: find all digits and join them.
```js run
let str = "+7(903)-123-45-67";
alert( str.match(/\d/g).join('') ); // 79031234567
```
An alternative, shorter way is to find non-digits `pattern:\D` and remove them from the string:
```js run
let str = "+7(903)-123-45-67";
alert( str.replace(/\D/g, "") ); // 79031234567
```
## A dot is any character
A dot `pattern:.` is a special character class that matches "any character except a newline".
For instance:
```js run
alert( "Z".match(/./) ); // Z
```
Or in the middle of a regexp:
```js run
let reg = /CS.4/;
alert( "CSS4".match(reg) ); // CSS4
alert( "CS-4".match(reg) ); // CS-4
alert( "CS 4".match(reg) ); // CS 4 (space is also a character)
```
Please note that a dot means "any character", but not the "absense of a character". There must be a character to match it:
```js run
alert( "CS4".match(/CS.4/) ); // null, no match because there's no character for the dot
```
### Dot as literally any character with "s" flag
Usually a dot doesn't match a newline character `\n`.
For instance, the regexp `pattern:A.B` matches `match:A`, and then `match:B` with any character between them, except a newline `\n`:
```js run
alert( "A\nB".match(/A.B/) ); // null (no match)
```
There are many situations when we'd like a dot to mean literally "any character", newline included.
That's what flag `pattern:s` does. If a regexp has it, then a dot `pattern:.` matches literally any character:
```js run
alert( "A\nB".match(/A.B/s) ); // A\nB (match!)
```
````warn header="Pay attention to spaces"
Usually we pay little attention to spaces. For us strings `subject:1-5` and `subject:1 - 5` are nearly identical.
But if a regexp doesn't take spaces into account, it may fail to work.
Let's try to find digits separated by a hyphen:
```js run
alert( "1 - 5".match(/\d-\d/) ); // null, no match!
```
Let's fix it adding spaces into the regexp `pattern:\d - \d`:
```js run
alert( "1 - 5".match(/\d - \d/) ); // 1 - 5, now it works
// or we can use \s class:
alert( "1 - 5".match(/\d\s-\s\d/) ); // 1 - 5, also works
```
**A space is a character. Equal in importance with any other character.**
We can't add or remove spaces from a regular expression and expect to work the same.
In other words, in a regular expression all characters matter, spaces too.
````
## Summary
There exist following character classes:
- `pattern:\d` -- digits.
- `pattern:\D` -- non-digits.
- `pattern:\s` -- space symbols, tabs, newlines.
- `pattern:\S` -- all but `pattern:\s`.
- `pattern:\w` -- Latin letters, digits, underscore `'_'`.
- `pattern:\W` -- all but `pattern:\w`.
- `pattern:.` -- any character if with the regexp `'s'` flag, otherwise any except a newline `\n`.
...But that's not all!
Unicode encoding, used by JavaScript for strings, provides many properties for characters, like: which language the letter belongs to (if it's a letter) it is it a punctuation sign, etc.
We can search by these properties as well. That requires flag `pattern:u`, covered in the next article.

View file

@ -1,6 +0,0 @@
The answer: `pattern:\b\d\d:\d\d\b`.
```js run
alert( "Breakfast at 09:00 in the room 123:456.".match( /\b\d\d:\d\d\b/ ) ); // 09:00
```

View file

@ -1,8 +0,0 @@
# Find the time
The time has a format: `hours:minutes`. Both hours and minutes has two digits, like `09:00`.
Make a regexp to find time in the string: `subject:Breakfast at 09:00 in the room 123:456.`
P.S. In this task there's no need to check time correctness yet, so `25:99` can also be a valid result.
P.P.S. The regexp shouldn't match `123:456`.

View file

@ -1,270 +0,0 @@
# Character classes
Consider a practical task -- we have a phone number `"+7(903)-123-45-67"`, and we need to turn it into pure numbers: `79035419441`.
To do so, we can find and remove anything that's not a number. Character classes can help with that.
A character class is a special notation that matches any symbol from a certain set.
For the start, let's explore a "digit" class. It's written as `\d`. We put it in the pattern, that means "any single digit".
For instance, the let's find the first digit in the phone number:
```js run
let str = "+7(903)-123-45-67";
let reg = /\d/;
alert( str.match(reg) ); // 7
```
Without the flag `g`, the regular expression only looks for the first match, that is the first digit `\d`.
Let's add the `g` flag to find all digits:
```js run
let str = "+7(903)-123-45-67";
let reg = /\d/g;
alert( str.match(reg) ); // array of matches: 7,9,0,3,1,2,3,4,5,6,7
alert( str.match(reg).join('') ); // 79035419441
```
That was a character class for digits. There are other character classes as well.
Most used are:
`\d` ("d" is from "digit")
: A digit: a character from `0` to `9`.
`\s` ("s" is from "space")
: A space symbol: that includes spaces, tabs, newlines.
`\w` ("w" is from "word")
: A "wordly" character: either a letter of English alphabet or a digit or an underscore. Non-Latin letters (like cyrillic or hindi) do not belong to `\w`.
For instance, `pattern:\d\s\w` means a "digit" followed by a "space character" followed by a "wordly character", like `"1 a"`.
**A regexp may contain both regular symbols and character classes.**
For instance, `pattern:CSS\d` matches a string `match:CSS` with a digit after it:
```js run
let str = "CSS4 is cool";
let reg = /CSS\d/
alert( str.match(reg) ); // CSS4
```
Also we can use many character classes:
```js run
alert( "I love HTML5!".match(/\s\w\w\w\w\d/) ); // ' HTML5'
```
The match (each character class corresponds to one result character):
![](love-html5-classes.svg)
## Word boundary: \b
A word boundary `pattern:\b` -- is a special character class.
It does not denote a character, but rather a boundary between characters.
For instance, `pattern:\bJava\b` matches `match:Java` in the string `subject:Hello, Java!`, but not in the script `subject:Hello, JavaScript!`.
```js run
alert( "Hello, Java!".match(/\bJava\b/) ); // Java
alert( "Hello, JavaScript!".match(/\bJava\b/) ); // null
```
The boundary has "zero width" in a sense that usually a character class means a character in the result (like a wordly character or a digit), but not in this case.
The boundary is a test.
When regular expression engine is doing the search, it's moving along the string in an attempt to find the match. At each string position it tries to find the pattern.
When the pattern contains `pattern:\b`, it tests that the position in string is a word boundary, that is one of three variants:
There are three different positions that qualify as word boundaries:
- At string start, if the first string character is a word character `\w`.
- Between two characters in the string, where one is a word character `\w` and the other is not.
- At string end, if the last string character is a word character `\w`.
For instance, in the string `subject:Hello, Java!` the following positions match `\b`:
![](hello-java-boundaries.svg)
So it matches `pattern:\bHello\b`, because:
1. At the beginning of the string the first `\b` test matches.
2. Then the word `Hello` matches.
3. Then `\b` matches, as we're between `o` (a word character) and a space (not a word character).
Pattern `pattern:\bJava\b` also matches. But not `pattern:\bHell\b` (because there's no word boundary after `l`) and not `Java!\b` (because the exclamation sign is not a wordly character, so there's no word boundary after it).
```js run
alert( "Hello, Java!".match(/\bHello\b/) ); // Hello
alert( "Hello, Java!".match(/\bJava\b/) ); // Java
alert( "Hello, Java!".match(/\bHell\b/) ); // null (no match)
alert( "Hello, Java!".match(/\bJava!\b/) ); // null (no match)
```
Once again let's note that `pattern:\b` makes the searching engine to test for the boundary, so that `pattern:Java\b` finds `match:Java` only when followed by a word boundary, but it does not add a letter to the result.
Usually we use `\b` to find standalone English words. So that if we want `"Java"` language then `pattern:\bJava\b` finds exactly a standalone word and ignores it when it's a part of another word, e.g. it won't match `match:Java` in `subject:JavaScript`.
Another example: a regexp `pattern:\b\d\d\b` looks for standalone two-digit numbers. In other words, it requires that before and after `pattern:\d\d` must be a symbol different from `\w` (or beginning/end of the string).
```js run
alert( "1 23 456 78".match(/\b\d\d\b/g) ); // 23,78
```
```warn header="Word boundary doesn't work for non-Latin alphabets"
The word boundary check `\b` tests for a boundary between `\w` and something else. But `\w` means an English letter (or a digit or an underscore), so the test won't work for other characters (like cyrillic or hieroglyphs).
Later we'll come by Unicode character classes that allow to solve the similar task for different languages.
```
## Inverse classes
For every character class there exists an "inverse class", denoted with the same letter, but uppercased.
The "reverse" means that it matches all other characters, for instance:
`\D`
: Non-digit: any character except `\d`, for instance a letter.
`\S`
: Non-space: any character except `\s`, for instance a letter.
`\W`
: Non-wordly character: anything but `\w`.
`\B`
: Non-boundary: a test reverse to `\b`.
In the beginning of the chapter we saw how to get all digits from the phone `subject:+7(903)-123-45-67`.
One way was to match all digits and join them:
```js run
let str = "+7(903)-123-45-67";
alert( str.match(/\d/g).join('') ); // 79031234567
```
An alternative, shorter way is to find non-digits `\D` and remove them from the string:
```js run
let str = "+7(903)-123-45-67";
alert( str.replace(/\D/g, "") ); // 79031234567
```
## Spaces are regular characters
Usually we pay little attention to spaces. For us strings `subject:1-5` and `subject:1 - 5` are nearly identical.
But if a regexp doesn't take spaces into account, it may fail to work.
Let's try to find digits separated by a dash:
```js run
alert( "1 - 5".match(/\d-\d/) ); // null, no match!
```
Here we fix it by adding spaces into the regexp `pattern:\d - \d`:
```js run
alert( "1 - 5".match(/\d - \d/) ); // 1 - 5, now it works
```
**A space is a character. Equal in importance with any other character.**
Of course, spaces in a regexp are needed only if we look for them. Extra spaces (just like any other extra characters) may prevent a match:
```js run
alert( "1-5".match(/\d - \d/) ); // null, because the string 1-5 has no spaces
```
In other words, in a regular expression all characters matter, spaces too.
## A dot is any character
The dot `"."` is a special character class that matches "any character except a newline".
For instance:
```js run
alert( "Z".match(/./) ); // Z
```
Or in the middle of a regexp:
```js run
let reg = /CS.4/;
alert( "CSS4".match(reg) ); // CSS4
alert( "CS-4".match(reg) ); // CS-4
alert( "CS 4".match(reg) ); // CS 4 (space is also a character)
```
Please note that the dot means "any character", but not the "absense of a character". There must be a character to match it:
```js run
alert( "CS4".match(/CS.4/) ); // null, no match because there's no character for the dot
```
### The dotall "s" flag
Usually a dot doesn't match a newline character.
For instance, `pattern:A.B` matches `match:A`, and then `match:B` with any character between them, except a newline.
This doesn't match:
```js run
alert( "A\nB".match(/A.B/) ); // null (no match)
// a space character would match, or a letter, but not \n
```
Sometimes it's inconvenient, we really want "any character", newline included.
That's what `s` flag does. If a regexp has it, then the dot `"."` match literally any character:
```js run
alert( "A\nB".match(/A.B/s) ); // A\nB (match!)
```
## Summary
There exist following character classes:
- `pattern:\d` -- digits.
- `pattern:\D` -- non-digits.
- `pattern:\s` -- space symbols, tabs, newlines.
- `pattern:\S` -- all but `pattern:\s`.
- `pattern:\w` -- English letters, digits, underscore `'_'`.
- `pattern:\W` -- all but `pattern:\w`.
- `pattern:.` -- any character if with the regexp `'s'` flag, otherwise any except a newline.
...But that's not all!
The Unicode encoding, used by JavaScript for strings, provides many properties for characters, like: which language the letter belongs to (if a letter) it is it a punctuation sign, etc.
Modern JavaScript allows to use these properties in regexps to look for characters, for instance:
- A cyrillic letter is: `pattern:\p{Script=Cyrillic}` or `pattern:\p{sc=Cyrillic}`.
- A dash (be it a small hyphen `-` or a long dash `—`): `pattern:\p{Dash_Punctuation}` or `pattern:\p{pd}`.
- A currency symbol, such as `$`, `€` or another: `pattern:\p{Currency_Symbol}` or `pattern:\p{sc}`.
- ...And much more. Unicode has a lot of character categories that we can select from.
These patterns require `'u'` regexp flag to work. More about that in the chapter [](info:regexp-unicode).

View file

@ -0,0 +1,167 @@
# Unicode: flag "u" and class \p{...}
JavaScript uses [Unicode encoding](https://en.wikipedia.org/wiki/Unicode) for strings. Most characters are encoding with 2 bytes, but that allows to represent at most 65536 characters.
That range is not big enough to encode all possible characters, that's why some rare characters are encoded with 4 bytes, for instance like `𝒳` (mathematical X) or `😄` (a smile), some hieroglyphs and so on.
Here are the unicode values of some characters:
| Character | Unicode | Bytes count in unicode |
|------------|---------|--------|
| a | `0x0061` | 2 |
| ≈ | `0x2248` | 2 |
|𝒳| `0x1d4b3` | 4 |
|𝒴| `0x1d4b4` | 4 |
|😄| `0x1f604` | 4 |
So characters like `a` and `≈` occupy 2 bytes, while codes for `𝒳`, `𝒴` and `😄` are longer, they have 4 bytes.
Long time ago, when JavaScript language was created, Unicode encoding was simpler: there were no 4-byte characters. So, some language features still handle them incorrectly.
For instance, `length` thinks that here are two characters:
```js run
alert('😄'.length); // 2
alert('𝒳'.length); // 2
```
...But we can see that there's only one, right? The point is that `length` treats 4 bytes as two 2-byte characters. That's incorrect, because they must be considered only together (so-called "surrogate pair", you can read about them in the article <info:string>).
By default, regular expressions also treat 4-byte "long characters" as a pair of 2-byte ones. And, as it happens with strings, that may lead to odd results. We'll see that a bit later, in the article <info:regexp-character-sets-and-ranges>.
Unlike strings, regular expressions have flag `pattern:u` that fixes such problems. With such flag, a regexp handles 4-byte characters correctly. And also Unicode property search becomes available, we'll get to it next.
## Unicode properties \p{...}
```warn header="Not supported in Firefox and Edge"
Despite being a part of the standard since 2018, unicode proeprties are not supported in Firefox ([bug](https://bugzilla.mozilla.org/show_bug.cgi?id=1361876)) and Edge ([bug](https://github.com/Microsoft/ChakraCore/issues/2969)).
There's [XRegExp](http://xregexp.com) library that provides "extended" regular expressions with cross-browser support for unicode properties.
```
Every character in Unicode has a lot of properties. They describe what "category" the character belongs to, contain miscellaneous information about it.
For instance, if a character has `Letter` property, it means that the character belongs to an alphabet (of any language). And `Number` property means that it's a digit: maybe Arabic or Chinese, and so on.
We can search for characters with a property, written as `pattern:\p{…}`. To use `pattern:\p{…}`, a regular expression must have flag `pattern:u`.
For instance, `\p{Letter}` denotes a letter in any of language. We can also use `\p{L}`, as `L` is an alias of `Letter`. There are shorter aliases for almost every property.
In the example below three kinds of letters will be found: English, Georgean and Korean.
```js run
let str = "A ბ ㄱ";
alert( str.match(/\p{L}/gu) ); // A,ბ,ㄱ
alert( str.match(/\p{L}/g) ); // null (no matches, as there's no flag "u")
```
Here's the main character categories and their subcategories:
- Letter `L`:
- lowercase `Ll`
- modifier `Lm`,
- titlecase `Lt`,
- uppercase `Lu`,
- other `Lo`.
- Number `N`:
- decimal digit `Nd`,
- letter number `Nl`,
- other `No`.
- Punctuation `P`:
- connector `Pc`,
- dash `Pd`,
- initial quote `Pi`,
- final quote `Pf`,
- open `Ps`,
- close `Pe`,
- other `Po`.
- Mark `M` (accents etc):
- spacing combining `Mc`,
- enclosing `Me`,
- non-spacing `Mn`.
- Symbol `S`:
- currency `Sc`,
- modifier `Sk`,
- math `Sm`,
- other `So`.
- Separator `Z`:
- line `Zl`,
- paragraph `Zp`,
- space `Zs`.
- Other `C`:
- control `Cc`,
- format `Cf`,
- not assigned `Cn`,
-- private use `Co`,
- surrogate `Cs`.
So, e.g. if we need letters in lower case, we can write `pattern:\p{Ll}`, punctuation signs: `pattern:\p{P}` and so on.
There are also other derived categories, like:
- `Alphabetic` (`Alpha`), includes Letters `L`, plus letter numbers `Nl` (e.g. Ⅻ - a character for the roman number 12), plus some other symbols `Other_Alphabetic` (`OAlpha`).
- `Hex_Digit` includes hexadecimal digits: `0-9`, `a-f`.
- ...And so on.
Unicode supports many different properties, their full list would require a lot of space, so here are the references:
- List all properties by a character: <https://unicode.org/cldr/utility/character.jsp>.
- List all characters by a property: <https://unicode.org/cldr/utility/list-unicodeset.jsp>.
- Short aliases for properties: <https://www.unicode.org/Public/UCD/latest/ucd/PropertyValueAliases.txt>.
- A full base of Unicode characters in text format, with all properties, is here: <https://www.unicode.org/Public/UCD/latest/ucd/>.
### Example: hexadecimal numbers
For instance, let's look for hexadecimal numbers, written as `xFF`, where `F` is a hex digit (0..1 or A..F).
A hex digit can be denoted as `pattern:\p{Hex_Digit}`:
```js run
let reg = /x\p{Hex_Digit}\p{Hex_Digit}/u;
alert("number: xAF".match(reg)); // xAF
```
### Example: Chinese hieroglyphs
Let's look for Chinese hieroglyphs.
There's a unicode property `Script` (a writing system), that may have a value: `Cyrillic`, `Greek`, `Arabic`, `Han` (Chinese) and so on, [here's the full list]("https://en.wikipedia.org/wiki/Script_(Unicode)").
To look for characters in a given writing system we should use `pattern:Script=<value>`, e.g. for Cyrillic letters: `pattern:\p{sc=Cyrillic}`, for Chinese hieroglyphs: `pattern:\p{sc=Han}`, and so on:
```js run
let regexp = /\p{sc=Han}/gu; // returns Chinese hieroglyphs
let str = `Hello Привет 你好 123_456`;
alert( str.match(regexp) ); // 你,好
```
### Example: currency
Characters that denote a currency, such as `$`, `€`, `¥`, have unicode property `pattern:\p{Currency_Symbol}`, the short alias: `pattern:\p{Sc}`.
Let's use it to look for prices in the format "currency, followed by a digit":
```js run
let regexp = /\p{Sc}\d/gu;
let str = `Prices: $2, €1, ¥9`;
alert( str.match(regexp) ); // $2,€1,¥9
```
Later, in the article <info:regexp-quantifiers> we'll see how to look for numbers that contain many digits.
## Summary
Flag `pattern:u` enables the support of Unicode in regular expressions.
That means two things:
1. Characters of 4 bytes are handled correctly: as a single character, not two 2-byte characters.
2. Unicode properties can be used in the search: `\p{…}`.
With Unicode properties we can look for words in given languages, special characters (quotes, currencies) and so on.

View file

@ -1,5 +1,4 @@
The empty string is the only match: it starts and immediately finishes.
An empty string is the only match: it starts and immediately finishes.
The task once again demonstrates that anchors are not characters, but tests.

View file

@ -0,0 +1,52 @@
# Anchors: string start ^ and end $
The caret `pattern:^` and dollar `pattern:$` characters have special meaning in a regexp. They are called "anchors".
The caret `pattern:^` matches at the beginning of the text, and the dollar `pattern:$` -- at the end.
For instance, let's test if the text starts with `Mary`:
```js run
let str1 = "Mary had a little lamb";
alert( /^Mary/.test(str1) ); // true
```
The pattern `pattern:^Mary` means: "string start and then Mary".
Similar to this, we can test if the string ends with `snow` using `pattern:snow$`:
```js run
let str1 = "it's fleece was white as snow";
alert( /snow$/.test(str1) ); // true
```
In these particular cases we could use string methods `startsWith/endsWith` instead. Regular expressions should be used for more complex tests.
## Testing for a full match
Both anchors together `pattern:^...$` are often used to test whether or not a string fully matches the pattern. For instance, to check if the user input is in the right format.
Let's check whether or not a string is a time in `12:34` format. That is: two digits, then a colon, and then another two digits.
In regular expressions language that's `pattern:\d\d:\d\d`:
```js run
let goodInput = "12:34";
let badInput = "12:345";
let regexp = /^\d\d:\d\d$/;
alert( regexp.test(goodInput) ); // true
alert( regexp.test(badInput) ); // false
```
Here the match for `pattern:\d\d:\d\d` must start exactly after the beginning of the text `pattern:^`, and the end `pattern:$` must immediately follow.
The whole string must be exactly in this format. If there's any deviation or an extra character, the result is `false`.
Anchors behave differently if flag `pattern:m` is present. We'll see that in the next article.
```smart header="Anchors have \"zero width\""
Anchors `pattern:^` and `pattern:$` are tests. They have zero width.
In other words, they do not match a character, but rather force the regexp engine to check the condition (text start/end).
```

View file

@ -0,0 +1,87 @@
# Multiline mode of anchors ^ $, flag "m"
The multiline mode is enabled by the flag `pattern:m`.
It only affects the behavior of `pattern:^` and `pattern:$`.
In the multiline mode they match not only at the beginning and the end of the string, but also at start/end of line.
## Searching at line start ^
In the example below the text has multiple lines. The pattern `pattern:/^\d/gm` takes a digit from the beginning of each line:
```js run
let str = `1st place: Winnie
2nd place: Piglet
3rd place: Eeyore`;
*!*
alert( str.match(/^\d/gm) ); // 1, 2, 3
*/!*
```
Without the flag `pattern:m` only the first digit is matched:
```js run
let str = `1st place: Winnie
2nd place: Piglet
3rd place: Eeyore`;
*!*
alert( str.match(/^\d/g) ); // 1
*/!*
```
That's because by default a caret `pattern:^` only matches at the beginning of the text, and in the multiline mode -- at the start of any line.
```smart
"Start of a line" formally means "immediately after a line break": the test `pattern:^` in multiline mode matches at all positions preceeded by a newline character `\n`.
And at the text start.
```
## Searching at line end $
The dollar sign `pattern:$` behaves similarly.
The regular expression `pattern:\d$` finds the last digit in every line
```js run
let str = `Winnie: 1
Piglet: 2
Eeyore: 3`;
alert( str.match(/\d$/gm) ); // 1,2,3
```
Without the flag `m`, the dollar `pattern:$` would only match the end of the whole text, so only the very last digit would be found.
```smart
"End of a line" formally means "immediately before a line break": the test `pattern:^` in multiline mode matches at all positions succeeded by a newline character `\n`.
And at the text end.
```
## Searching for \n instead of ^ $
To find a newline, we can use not only anchors `pattern:^` and `pattern:$`, but also the newline character `\n`.
What's the difference? Let's see an example.
Here we search for `pattern:\d\n` instead of `pattern:\d$`:
```js run
let str = `Winnie: 1
Piglet: 2
Eeyore: 3`;
alert( str.match(/\d\n/gm) ); // 1\n,2\n
```
As we can see, there are 2 matches instead of 3.
That's because there's no newline after `subject:3` (there's text end though, so it matches `pattern:$`).
Another difference: now every match includes a newline character `match:\n`. Unlike the anchors `pattern:^` `pattern:$`, that only test the condition (start/end of a line), `\n` is a character, so it becomes a part of the result.
So, a `\n` in the pattern is used when we need newline characters in the result, while anchors are used to find something at the beginning/end of a line.

View file

@ -0,0 +1,6 @@
Ответ: `pattern:\b\d\d:\d\d\b`.
```js run
alert( "Завтрак в 09:00 в комнате 123:456.".match( /\b\d\d:\d\d\b/ ) ); // 09:00
```

View file

@ -0,0 +1,9 @@
# Найдите время
Время имеет формат: `часы:минуты`. И часы, и минуты имеют две цифры, например, `09:00`.
Введите регулярное выражение, чтобы найти время в строке: `subject:Завтрак в 09:00 в комнате 123:456.`
P.S. В этой задаче пока нет необходимости проверять правильность времени, поэтому `25:99` также может быть верным результатом.
P.P.S. Регулярное выражение не должно находить `123:456`.

View file

@ -0,0 +1,53 @@
# Word boundary: \b
A word boundary `pattern:\b` is a test, just like `pattern:^` and `pattern:$`.
When the regexp engine (program module that implements searching for regexps) comes across `pattern:\b`, it checks that the position in the string is a word boundary.
There are three different positions that qualify as word boundaries:
- At string start, if the first string character is a word character `pattern:\w`.
- Between two characters in the string, where one is a word character `pattern:\w` and the other is not.
- At string end, if the last string character is a word character `pattern:\w`.
For instance, regexp `pattern:\bJava\b` will be found in `subject:Hello, Java!`, where `subject:Java` is a standalone word, but not in `subject:Hello, JavaScript!`.
```js run
alert( "Hello, Java!".match(/\bJava\b/) ); // Java
alert( "Hello, JavaScript!".match(/\bJava\b/) ); // null
```
In the string `subject:Hello, Java!` following positions correspond to `pattern:\b`:
![](hello-java-boundaries.svg)
So, it matches the pattern `pattern:\bHello\b`, because:
1. At the beginning of the string matches the first test `pattern:\b`.
2. Then matches the word `pattern:Hello`.
3. Then the test `pattern:\b` - matches again, as we're between `subject:o` and a space.
Шаблон `pattern:\bJava\b` также совпадёт. Но не `pattern:\bHell\b` (потому что после `subject:l` нет границы слова), и не `pattern:Java!\b` (восклицательный знак не является "символом слова" `pattern:\w`, поэтому после него нет границы слова).
```js run
alert( "Hello, Java!".match(/\bHello\b/) ); // Hello
alert( "Hello, Java!".match(/\bJava\b/) ); // Java
alert( "Hello, Java!".match(/\bHell\b/) ); // null (нет совпадения)
alert( "Hello, Java!".match(/\bJava!\b/) ); // null (нет совпадения)
```
Так как `pattern:\b` является проверкой, то не добавляет символ после границы к результату.
Мы можем использовать `pattern:\b` не только со словами, но и с цифрами.
Например, регулярное выражение `pattern:\b\d\d\b` ищет отдельно стоящие двузначные числа. Другими словами, оно требует, чтобы до и после `pattern:\d\d` был символ, отличный от `pattern:\w` (или начало/конец строки)
```js run
alert( "1 23 456 78".match(/\b\d\d\b/g) ); // 23,78
```
```warn header="Граница слова `pattern:\b` не работает для алфавитов, не основанных на латинице"
Проверка границы слова `pattern:\b` проверяет границу, должно быть `pattern:\w` с одной стороны и "не `pattern:\w`" - с другой.
Но `pattern:\w` означает латинскую букву (или цифру или знак подчёркивания), поэтому проверка не будет работать для других символов (например, кириллицы или иероглифов).
```

View file

@ -75,7 +75,7 @@ The quotes "consume" backslashes and interpret them, for instance:
- `\n` -- becomes a newline character,
- `\u1234` -- becomes the Unicode character with such code,
- ...And when there's no special meaning: like `\d` or `\z`, then the backslash is simply removed.
- ...And when there's no special meaning: like `pattern:\d` or `\z`, then the backslash is simply removed.
So the call to `new RegExp` gets a string without backslashes. That's why the search doesn't work!

View file

@ -44,7 +44,7 @@ alert( "Exception 0xAF".match(/x[0-9A-F][0-9A-F]/g) ); // xAF
Please note that in the word `subject:Exception` there's a substring `subject:xce`. It didn't match the pattern, because the letters are lowercase, while in the set `pattern:[0-9A-F]` they are uppercase.
If we want to find it too, then we can add a range `a-f`: `pattern:[0-9A-Fa-f]`. The `i` flag would allow lowercase too.
If we want to find it too, then we can add a range `a-f`: `pattern:[0-9A-Fa-f]`. The `pattern:i` flag would allow lowercase too.
**Character classes are shorthands for certain character sets.**
@ -58,7 +58,7 @@ We can use character classes inside `[…]` as well.
For instance, we want to match all wordly characters or a dash, for words like "twenty-third". We can't do it with `pattern:\w+`, because `pattern:\w` class does not include a dash. But we can use `pattern:[\w-]`.
We also can use several classes, for example `pattern:[\s\S]` matches spaces or non-spaces -- any character. That's wider than a dot `"."`, because the dot matches any character except a newline (unless `s` flag is set).
We also can use several classes, for example `pattern:[\s\S]` matches spaces or non-spaces -- any character. That's wider than a dot `"."`, because the dot matches any character except a newline (unless `pattern:s` flag is set).
## Excluding ranges
@ -69,7 +69,7 @@ They are denoted by a caret character `^` at the start and match any character *
For instance:
- `pattern:[^aeyo]` -- any character except `'a'`, `'e'`, `'y'` or `'o'`.
- `pattern:[^0-9]` -- any character except a digit, the same as `\D`.
- `pattern:[^0-9]` -- any character except a digit, the same as `pattern:\D`.
- `pattern:[^\s]` -- any non-space character, same as `\S`.
The example below looks for any characters except letters, digits and spaces:

View file

@ -1,6 +1,6 @@
We need to look for `#` followed by 6 hexadecimal characters.
A hexadecimal character can be described as `pattern:[0-9a-fA-F]`. Or if we use the `i` flag, then just `pattern:[0-9a-f]`.
A hexadecimal character can be described as `pattern:[0-9a-fA-F]`. Or if we use the `pattern:i` flag, then just `pattern:[0-9a-f]`.
Then we can look for 6 of them using the quantifier `pattern:{6}`.

View file

@ -2,7 +2,7 @@
Let's say we have a string like `+7(903)-123-45-67` and want to find all numbers in it. But unlike before, we are interested not in single digits, but full numbers: `7, 903, 123, 45, 67`.
A number is a sequence of 1 or more digits `\d`. To mark how many we need, we need to append a *quantifier*.
A number is a sequence of 1 or more digits `pattern:\d`. To mark how many we need, we need to append a *quantifier*.
## Quantity {n}

View file

Before

Width:  |  Height:  |  Size: 1.1 KiB

After

Width:  |  Height:  |  Size: 1.1 KiB

Before After
Before After

View file

Before

Width:  |  Height:  |  Size: 1.4 KiB

After

Width:  |  Height:  |  Size: 1.4 KiB

Before After
Before After

View file

Before

Width:  |  Height:  |  Size: 9.8 KiB

After

Width:  |  Height:  |  Size: 9.8 KiB

Before After
Before After

View file

Before

Width:  |  Height:  |  Size: 9.7 KiB

After

Width:  |  Height:  |  Size: 9.7 KiB

Before After
Before After

View file

Before

Width:  |  Height:  |  Size: 9.4 KiB

After

Width:  |  Height:  |  Size: 9.4 KiB

Before After
Before After

View file

Before

Width:  |  Height:  |  Size: 7.6 KiB

After

Width:  |  Height:  |  Size: 7.6 KiB

Before After
Before After

View file

Before

Width:  |  Height:  |  Size: 1.5 KiB

After

Width:  |  Height:  |  Size: 1.5 KiB

Before After
Before After

View file

Before

Width:  |  Height:  |  Size: 1.8 KiB

After

Width:  |  Height:  |  Size: 1.8 KiB

Before After
Before After

View file

Before

Width:  |  Height:  |  Size: 2.7 KiB

After

Width:  |  Height:  |  Size: 2.7 KiB

Before After
Before After

View file

Before

Width:  |  Height:  |  Size: 4.5 KiB

After

Width:  |  Height:  |  Size: 4.5 KiB

Before After
Before After

View file

Before

Width:  |  Height:  |  Size: 2.8 KiB

After

Width:  |  Height:  |  Size: 2.8 KiB

Before After
Before After

View file

@ -1,21 +0,0 @@
A two-digit hex number is `pattern:[0-9a-f]{2}` (assuming the `pattern:i` flag is enabled).
We need that number `NN`, and then `:NN` repeated 5 times (more numbers);
The regexp is: `pattern:[0-9a-f]{2}(:[0-9a-f]{2}){5}`
Now let's show that the match should capture all the text: start at the beginning and end at the end. That's done by wrapping the pattern in `pattern:^...$`.
Finally:
```js run
let reg = /^[0-9a-fA-F]{2}(:[0-9a-fA-F]{2}){5}$/i;
alert( reg.test('01:32:54:67:89:AB') ); // true
alert( reg.test('0132546789AB') ); // false (no colons)
alert( reg.test('01:32:54:67:89') ); // false (5 numbers, need 6)
alert( reg.test('01:32:54:67:89:ZZ') ) // false (ZZ in the end)
```

View file

@ -1,20 +0,0 @@
# Check MAC-address
[MAC-address](https://en.wikipedia.org/wiki/MAC_address) of a network interface consists of 6 two-digit hex numbers separated by a colon.
For instance: `subject:'01:32:54:67:89:AB'`.
Write a regexp that checks whether a string is MAC-address.
Usage:
```js
let reg = /your regexp/;
alert( reg.test('01:32:54:67:89:AB') ); // true
alert( reg.test('0132546789AB') ); // false (no colons)
alert( reg.test('01:32:54:67:89') ); // false (5 numbers, must be 6)
alert( reg.test('01:32:54:67:89:ZZ') ) // false (ZZ ad the end)
```

View file

@ -1,55 +0,0 @@
# String start ^ and finish $
The caret `pattern:'^'` and dollar `pattern:'$'` characters have special meaning in a regexp. They are called "anchors".
The caret `pattern:^` matches at the beginning of the text, and the dollar `pattern:$` -- in the end.
For instance, let's test if the text starts with `Mary`:
```js run
let str1 = "Mary had a little lamb, it's fleece was white as snow";
let str2 = 'Everywhere Mary went, the lamp was sure to go';
alert( /^Mary/.test(str1) ); // true
alert( /^Mary/.test(str2) ); // false
```
The pattern `pattern:^Mary` means: "the string start and then Mary".
Now let's test whether the text ends with an email.
To match an email, we can use a regexp `pattern:[-.\w]+@([\w-]+\.)+[\w-]{2,20}`.
To test whether the string ends with the email, let's add `pattern:$` to the pattern:
```js run
let reg = /[-.\w]+@([\w-]+\.)+[\w-]{2,20}$/g;
let str1 = 'My email is mail@site.com';
let str2 = 'Everywhere Mary went, the lamp was sure to go';
alert( reg.test(str1) ); // true
alert( reg.test(str2) ); // false
```
We can use both anchors together to check whether the string exactly follows the pattern. That's often used for validation.
For instance we want to check that `str` is exactly a color in the form `#` plus 6 hex digits. The pattern for the color is `pattern:#[0-9a-f]{6}`.
To check that the *whole string* exactly matches it, we add `pattern:^...$`:
```js run
let str = "#abcdef";
alert( /^#[0-9a-f]{6}$/i.test(str) ); // true
```
The regexp engine looks for the text start, then the color, and then immediately the text end. Just what we need.
```smart header="Anchors have zero length"
Anchors just like `\b` are tests. They have zero-width.
In other words, they do not match a character, but rather force the regexp engine to check the condition (text start/end).
```
The behavior of anchors changes if there's a flag `pattern:m` (multiline mode). We'll explore it in the next chapter.

View file

@ -1,7 +1,7 @@
Opening tag is `pattern:\[(b|url|quote)\]`.
Then to find everything till the closing tag -- let's use the pattern `pattern:.*?` with flag `s` to match any character including the newline and then add a backreference to the closing tag.
Then to find everything till the closing tag -- let's use the pattern `pattern:.*?` with flag `pattern:s` to match any character including the newline and then add a backreference to the closing tag.
The full pattern: `pattern:\[(b|url|quote)\].*?\[/\1\]`.

View file

@ -1,75 +0,0 @@
# Multiline mode, flag "m"
The multiline mode is enabled by the flag `pattern:/.../m`.
It only affects the behavior of `pattern:^` and `pattern:$`.
In the multiline mode they match not only at the beginning and end of the string, but also at start/end of line.
## Line start ^
In the example below the text has multiple lines. The pattern `pattern:/^\d+/gm` takes a number from the beginning of each one:
```js run
let str = `1st place: Winnie
2nd place: Piglet
33rd place: Eeyore`;
*!*
alert( str.match(/^\d+/gm) ); // 1, 2, 33
*/!*
```
The regexp engine moves along the text and looks for a line start `pattern:^`, when finds -- continues to match the rest of the pattern `pattern:\d+`.
Without the flag `pattern:/.../m` only the first number is matched:
```js run
let str = `1st place: Winnie
2nd place: Piglet
33rd place: Eeyore`;
*!*
alert( str.match(/^\d+/g) ); // 1
*/!*
```
That's because by default a caret `pattern:^` only matches at the beginning of the text, and in the multiline mode -- at the start of any line.
## Line end $
The dollar sign `pattern:$` behaves similarly.
The regular expression `pattern:\w+$` finds the last word in every line
```js run
let str = `1st place: Winnie
2nd place: Piglet
33rd place: Eeyore`;
alert( str.match(/\w+$/gim) ); // Winnie,Piglet,Eeyore
```
Without the `pattern:/.../m` flag the dollar `pattern:$` would only match the end of the whole string, so only the very last word would be found.
## Anchors ^$ versus \n
To find a newline, we can use not only `pattern:^` and `pattern:$`, but also the newline character `\n`.
The first difference is that unlike anchors, the character `\n` "consumes" the newline character and adds it to the result.
For instance, here we use it instead of `pattern:$`:
```js run
let str = `1st place: Winnie
2nd place: Piglet
33rd place: Eeyore`;
alert( str.match(/\w+\n/gim) ); // Winnie\n,Piglet\n
```
Here every match is a word plus a newline character.
And one more difference -- the newline `\n` does not match at the string end. That's why `Eeyore` is not found in the example above.
So, anchors are usually better, they are closer to what we want to get.

View file

@ -101,9 +101,9 @@ Lookaround types:
| Pattern | type | matches |
|--------------------|------------------|---------|
| `pattern:x(?=y)` | Positive lookahead | `x` if followed by `y` |
| `pattern:x(?!y)` | Negative lookahead | `x` if not followed by `y` |
| `pattern:(?<=y)x` | Positive lookbehind | `x` if after `y` |
| `pattern:(?<!y)x` | Negative lookbehind | `x` if not after `y` |
| `pattern:x(?=y)` | Positive lookahead | `x` if followed by `pattern:y` |
| `pattern:x(?!y)` | Negative lookahead | `x` if not followed by `pattern:y` |
| `pattern:(?<=y)x` | Positive lookbehind | `x` if after `pattern:y` |
| `pattern:(?<!y)x` | Negative lookbehind | `x` if not after `pattern:y` |
Lookahead can also used to disable backtracking. Why that may be needed and other details -- see in the next chapter.

View file

@ -1,7 +1,7 @@
# Sticky flag "y", searching at position
To grasp the use case of `y` flag, and see how great it is, let's explore a practical use case.
To grasp the use case of `pattern:y` flag, and see how great it is, let's explore a practical use case.
One of common tasks for regexps is "parsing": when we get a text and analyze it for logical components, build a structure.
@ -43,7 +43,7 @@ We could work around that by checking if "`regexp.exec(str).index` property is `
So we've came to the problem: how to search for a match exactly at the given position.
That's what `y` flag does. It makes the regexp search only at the `lastIndex` position.
That's what `pattern:y` flag does. It makes the regexp search only at the `lastIndex` position.
Here's an example
@ -66,8 +66,8 @@ alert (regexp.exec(str)); // function (match!)
As we can see, now the regexp is only matched at the given position.
So what `y` does is truly unique, and very important for writing parsers.
So what `pattern:y` does is truly unique, and very important for writing parsers.
The `y` flag allows to test a regular expression exactly at the given position and when we understand what's there, we can move on -- step by step examining the text.
The `pattern:y` flag allows to test a regular expression exactly at the given position and when we understand what's there, we can move on -- step by step examining the text.
Without the flag the regexp engine always searches till the end of the text, that takes time, especially if the text is large. So our parser would be very slow. The `y` flag is exactly the right thing here.
Without the flag the regexp engine always searches till the end of the text, that takes time, especially if the text is large. So our parser would be very slow. The `pattern:y` flag is exactly the right thing here.

View file

@ -16,12 +16,12 @@ So, here are general recipes, the details to follow:
**To search for all matches:**
Use regexp `g` flag and:
Use regexp `pattern:g` flag and:
- Get a flat array of matches -- `str.match(reg)`
- Get an array or matches with details -- `str.matchAll(reg)`.
**To search for the first match only:**
- Get the full first match -- `str.match(reg)` (without `g` flag).
- Get the full first match -- `str.match(reg)` (without `pattern:g` flag).
- Get the string position of the first match -- `str.search(reg)`.
- Check if there's a match -- `regexp.test(str)`.
- Find the match from the given position -- `regexp.exec(str)` (set `regexp.lastIndex` to position).
@ -50,9 +50,9 @@ We can't find next matches using `search`, there's just no syntax for that. But
## str.match(reg), no "g" flag
The behavior of `str.match` varies depending on whether `reg` has `g` flag or not.
The behavior of `str.match` varies depending on whether `reg` has `pattern:g` flag or not.
First, if there's no `g` flag, then `str.match(reg)` looks for the first match only.
First, if there's no `pattern:g` flag, then `str.match(reg)` looks for the first match only.
The result is an array with that match and additional properties:
@ -90,7 +90,7 @@ alert( result.index ); // 0
alert( result.input ); // JavaScript is a programming language
```
Due to the `i` flag the search is case-insensitive, so it finds `match:JavaScript`. The part of the match that corresponds to `pattern:SCRIPT` becomes a separate array item.
Due to the `pattern:i` flag the search is case-insensitive, so it finds `match:JavaScript`. The part of the match that corresponds to `pattern:SCRIPT` becomes a separate array item.
So, this method is used to find one full match with all details.
@ -119,7 +119,7 @@ let result = str.match( *!*/h(o)/ig*/!* );
alert( result ); // HO, Ho, ho
```
**So, with `g` flag `str.match` returns a simple array of all matches, without details.**
**So, with `pattern:g` flag `str.match` returns a simple array of all matches, without details.**
If we want to get information about match positions and contents of parentheses then we should use `matchAll` method that we'll cover below.
@ -230,14 +230,14 @@ There's a pitfall though.
You can see that in the example above: only the first `"-"` is replaced by `":"`.
To find all dashes, we need to use not the string `"-"`, but a regexp `pattern:/-/g`, with an obligatory `g` flag:
To find all dashes, we need to use not the string `"-"`, but a regexp `pattern:/-/g`, with an obligatory `pattern:g` flag:
```js run
// replace all dashes by a colon
alert( '12-34-56'.replace( *!*/-/g*/!*, ":" ) ) // 12:34:56
```
The second argument is a replacement string. We can use special characters in it:
The second argument is a replacement string. We can use special character in it:
| Symbol | Inserts |
|--------|--------|
@ -339,17 +339,17 @@ Using a function gives us the ultimate replacement power, because it gets all th
We've already seen these searching methods:
- `search` -- looks for the position of the match,
- `match` -- if there's no `g` flag, returns the first match with parentheses and all details,
- `match` -- if there's a `g` flag -- returns all matches, without details parentheses,
- `match` -- if there's no `pattern:g` flag, returns the first match with parentheses and all details,
- `match` -- if there's a `pattern:g` flag -- returns all matches, without details parentheses,
- `matchAll` -- returns all matches with details.
The `regexp.exec` method is the most flexible searching method of all. Unlike previous methods, `exec` should be called on a regexp, rather than on a string.
It behaves differently depending on whether the regexp has the `g` flag.
It behaves differently depending on whether the regexp has the `pattern:g` flag.
If there's no `g`, then `regexp.exec(str)` returns the first match, exactly as `str.match(reg)`. Such behavior does not give us anything new.
If there's no `pattern:g`, then `regexp.exec(str)` returns the first match, exactly as `str.match(reg)`. Such behavior does not give us anything new.
But if there's `g`, then:
But if there's `pattern:g`, then:
- `regexp.exec(str)` returns the first match and *remembers* the position after it in `regexp.lastIndex` property.
- The next call starts to search from `regexp.lastIndex` and returns the next match.
- If there are no more matches then `regexp.exec` returns `null` and `regexp.lastIndex` is set to `0`.

View file

@ -1,89 +0,0 @@
# Unicode: flag "u"
The unicode flag `/.../u` enables the correct support of surrogate pairs.
Surrogate pairs are explained in the chapter <info:string>.
Let's briefly review them here. In short, normally characters are encoded with 2 bytes. That gives us 65536 characters maximum. But there are more characters in the world.
So certain rare characters are encoded with 4 bytes, like `𝒳` (mathematical X) or `😄` (a smile).
Here are the unicode values to compare:
| Character | Unicode | Bytes |
|------------|---------|--------|
| `a` | 0x0061 | 2 |
| `≈` | 0x2248 | 2 |
|`𝒳`| 0x1d4b3 | 4 |
|`𝒴`| 0x1d4b4 | 4 |
|`😄`| 0x1f604 | 4 |
So characters like `a` and `≈` occupy 2 bytes, and those rare ones take 4.
The unicode is made in such a way that the 4-byte characters only have a meaning as a whole.
In the past JavaScript did not know about that, and many string methods still have problems. For instance, `length` thinks that here are two characters:
```js run
alert('😄'.length); // 2
alert('𝒳'.length); // 2
```
...But we can see that there's only one, right? The point is that `length` treats 4 bytes as two 2-byte characters. That's incorrect, because they must be considered only together (so-called "surrogate pair").
Normally, regular expressions also treat "long characters" as two 2-byte ones.
That leads to odd results, for instance let's try to find `pattern:[𝒳𝒴]` in the string `subject:𝒳`:
```js run
alert( '𝒳'.match(/[𝒳𝒴]/) ); // odd result (wrong match actually, "half-character")
```
The result is wrong, because by default the regexp engine does not understand surrogate pairs.
So, it thinks that `[𝒳𝒴]` are not two, but four characters:
1. the left half of `𝒳` `(1)`,
2. the right half of `𝒳` `(2)`,
3. the left half of `𝒴` `(3)`,
4. the right half of `𝒴` `(4)`.
We can list them like this:
```js run
for(let i=0; i<'𝒳𝒴'.length; i++) {
alert('𝒳𝒴'.charCodeAt(i)); // 55349, 56499, 55349, 56500
};
```
So it finds only the "left half" of `𝒳`.
In other words, the search works like `'12'.match(/[1234]/)`: only `1` is returned.
## The "u" flag
The `/.../u` flag fixes that.
It enables surrogate pairs in the regexp engine, so the result is correct:
```js run
alert( '𝒳'.match(/[𝒳𝒴]/u) ); // 𝒳
```
Let's see one more example.
If we forget the `u` flag and accidentally use surrogate pairs, then we can get an error:
```js run
'𝒳'.match(/[𝒳-𝒴]/); // SyntaxError: invalid range in character class
```
Normally, regexps understand `[a-z]` as a "range of characters with codes between codes of `a` and `z`.
But without `u` flag, surrogate pairs are assumed to be a "pair of independent characters", so `[𝒳-𝒴]` is like `[<55349><56499>-<55349><56500>]` (replaced each surrogate pair with code points). Now we can clearly see that the range `56499-55349` is unacceptable, as the left range border must be less than the right one.
Using the `u` flag makes it work right:
```js run
alert( '𝒴'.match(/[𝒳-𝒵]/u) ); // 𝒴
```

View file

@ -1,86 +0,0 @@
# Unicode character properties \p
[Unicode](https://en.wikipedia.org/wiki/Unicode), the encoding format used by JavaScript strings, has a lot of properties for different characters (or, technically, code points). They describe which "categories" character belongs to, and a variety of technical details.
In regular expressions these can be set by `\p{…}`. And there must be flag `'u'`.
For instance, `\p{Letter}` denotes a letter in any of language. We can also use `\p{L}`, as `L` is an alias of `Letter`, there are shorter aliases for almost every property.
Here's the main tree of properties:
- Letter `L`:
- lowercase `Ll`, modifier `Lm`, titlecase `Lt`, uppercase `Lu`, other `Lo`
- Number `N`:
- decimal digit `Nd`, letter number `Nl`, other `No`
- Punctuation `P`:
- connector `Pc`, dash `Pd`, initial quote `Pi`, final quote `Pf`, open `Ps`, close `Pe`, other `Po`
- Mark `M` (accents etc):
- spacing combining `Mc`, enclosing `Me`, non-spacing `Mn`
- Symbol `S`:
- currency `Sc`, modifier `Sk`, math `Sm`, other `So`
- Separator `Z`:
- line `Zl`, paragraph `Zp`, space `Zs`
- Other `C`:
- control `Cc`, format `Cf`, not assigned `Cn`, private use `Co`, surrogate `Cs`
```smart header="More information"
Interested to see which characters belong to a property? There's a tool at <http://cldr.unicode.org/unicode-utilities/list-unicodeset> for that.
You could also explore properties at [Character Property Index](http://unicode.org/cldr/utility/properties.jsp).
For the full Unicode Character Database in text format (along with all properties), see <https://www.unicode.org/Public/UCD/latest/ucd/>.
```
There are also other derived categories, like:
- `Alphabetic` (`Alpha`), includes Letters `L`, plus letter numbers `Nl` (e.g. roman numbers Ⅻ), plus some other symbols `Other_Alphabetic` (`OAltpa`).
- `Hex_Digit` includes hexadecimal digits: `0-9`, `a-f`.
- ...Unicode is a big beast, it includes a lot of properties.
For instance, let's look for a 6-digit hex number:
```js run
let reg = /\p{Hex_Digit}{6}/u; // flag 'u' is required
alert("color: #123ABC".match(reg)); // 123ABC
```
There are also properties with a value. For instance, Unicode "Script" (a writing system) can be Cyrillic, Greek, Arabic, Han (Chinese) etc, the [list is long]("https://en.wikipedia.org/wiki/Script_(Unicode)").
To search for characters in certain scripts ("alphabets"), we should supply `Script=<value>`, e.g. to search for cyrillic letters: `\p{sc=Cyrillic}`, for Chinese glyphs: `\p{sc=Han}`, etc:
```js run
let regexp = /\p{sc=Han}+/gu; // get chinese words
let str = `Hello Привет 你好 123_456`;
alert( str.match(regexp) ); // 你好
```
## Building multi-language \w
The pattern `pattern:\w` means "wordly characters", but doesn't work for languages that use non-Latin alphabets, such as Cyrillic and others. It's just a shorthand for `[a-zA-Z0-9_]`, so `pattern:\w+` won't find any Chinese words etc.
Let's make a "universal" regexp, that looks for wordly characters in any language. That's easy to do using Unicode properties:
```js
/[\p{Alphabetic}\p{Mark}\p{Decimal_Number}\p{Connector_Punctuation}\p{Join_Control}]/u
```
Let's decipher. Just as `pattern:\w` is the same as `pattern:[a-zA-Z0-9_]`, we're making a set of our own, that includes:
- `Alphabetic` for letters,
- `Mark` for accents, as in Unicode accents may be represented by separate code points,
- `Decimal_Number` for numbers,
- `Connector_Punctuation` for the `'_'` character and alike,
- `Join_Control` - two special code points with hex codes `200c` and `200d`, used in ligatures e.g. in arabic.
Or, if we replace long names with aliases (a list of aliases [here](https://www.unicode.org/Public/UCD/latest/ucd/PropertyValueAliases.txt)):
```js run
let regexp = /([\p{Alpha}\p{M}\p{Nd}\p{Pc}\p{Join_C}]+)/gu;
let str = `Hello Привет 你好 123_456`;
alert( str.match(regexp) ); // Hello,Привет,你好,123_456
```

View file

@ -1,7 +1,3 @@
# Regular expressions
Regular expressions is a powerful way of doing search and replace in strings.
In JavaScript regular expressions are implemented using objects of a built-in `RegExp` class and integrated with strings.
Please note that regular expressions vary between programming languages. In this tutorial we concentrate on JavaScript. Of course there's a lot in common, but they are a somewhat different in Perl, Ruby, PHP etc.