This commit is contained in:
Ilya Kantor 2019-09-05 14:57:06 +03:00
parent fc0b18538d
commit 20547570ff
12 changed files with 376 additions and 186 deletions

View file

@ -22,7 +22,7 @@ So the example below gives no matches:
alert( "Voila".match(/V[oi]la/) ); // null, no matches
```
The pattern assumes:
The pattern searches for:
- `pattern:V`,
- then *one* of the letters `pattern:[oi]`,
@ -42,23 +42,56 @@ In the example below we're searching for `"x"` followed by two digits or letters
alert( "Exception 0xAF".match(/x[0-9A-F][0-9A-F]/g) ); // xAF
```
Please note that in the word `subject:Exception` there's a substring `subject:xce`. It didn't match the pattern, because the letters are lowercase, while in the set `pattern:[0-9A-F]` they are uppercase.
Here `pattern:[0-9A-F]` has two ranges: it searches for a character that is either a digit from `0` to `9` or a letter from `A` to `F`.
If we want to find it too, then we can add a range `a-f`: `pattern:[0-9A-Fa-f]`. The `pattern:i` flag would allow lowercase too.
If we'd like to look for lowercase letters as well, we can add the range `a-f`: `pattern:[0-9A-Fa-f]`. Or add the flag `pattern:i`.
**Character classes are shorthands for certain character sets.**
We can also use character classes inside `[…]`.
For instance, if we'd like to look for a wordly character `pattern:\w` or a hyphen `pattern:-`, then the set is `pattern:[\w-]`.
Combining multiple classes is also possible, e.g. `pattern:[\s\d]` means "a space character or a digit".
```smart header="Character classes are shorthands for certain character sets"
For instance:
- **\d** -- is the same as `pattern:[0-9]`,
- **\w** -- is the same as `pattern:[a-zA-Z0-9_]`,
- **\s** -- is the same as `pattern:[\t\n\v\f\r ]` plus few other unicode space characters.
- **\s** -- is the same as `pattern:[\t\n\v\f\r ]`, plus few other rare unicode space characters.
```
We can use character classes inside `[…]` as well.
### Example: multi-language \w
For instance, we want to match all wordly characters or a dash, for words like "twenty-third". We can't do it with `pattern:\w+`, because `pattern:\w` class does not include a dash. But we can use `pattern:[\w-]`.
As the character class `pattern:\w` is a shorthand for `pattern:[a-zA-Z0-9_]`, it can't find Chinese hieroglyphs, Cyrillic letters, etc.
We also can use several classes, for example `pattern:[\s\S]` matches spaces or non-spaces -- any character. That's wider than a dot `"."`, because the dot matches any character except a newline (unless `pattern:s` flag is set).
We can write a more universal pattern, that looks for wordly characters in any language. That's easy with unicode properties: `pattern:[\p{Alpha}\p{M}\p{Nd}\p{Pc}\p{Join_C}]`.
Let's decipher it. Similar to `pattern:\w`, we're making a set of our own that includes characters with following unicode properties:
- `Alphabetic` (`Alpha`) - for letters,
- `Mark` (`M`) - for accents,
- `Decimal_Number` (`Nd`) - for digits,
- `Connector_Punctuation` (`Pc`) - for the underscore `'_'` and similar characters,
- `Join_Control` (`Join_C`) - two special codes `200c` and `200d`, used in ligatures, e.g. in Arabic.
An example of use:
```js run
let regexp = /[\p{Alpha}\p{M}\p{Nd}\p{Pc}\p{Join_C}]/gu;
let str = `Hi 你好 12`;
// finds all letters and digits:
alert( str.match(regexp) ); // H,i,你,好,1,2
```
Of course, we can edit this pattern: add unicode properties or remove them. Unicode properties are covered in more details in the article <info:regexp-unicode>.
```warn header="Unicode properties aren't supported in Edge and Firefox"
Unicode properties `pattern:p{…}` are not yet implemented in Edge and Firefox. If we really need them, we can use library [XRegExp](http://xregexp.com/).
Or just use ranges of characters in a language that interests us, e.g. `pattern:[а-я]` for Cyrillic letters.
```
## Excluding ranges
@ -78,22 +111,20 @@ The example below looks for any characters except letters, digits and spaces:
alert( "alice15@gmail.com".match(/[^\d\sA-Z]/gi) ); // @ and .
```
## No escaping in […]
## Escaping in […]
Usually when we want to find exactly the dot character, we need to escape it like `pattern:\.`. And if we need a backslash, then we use `pattern:\\`.
Usually when we want to find exactly a special character, we need to escape it like `pattern:\.`. And if we need a backslash, then we use `pattern:\\`, and so on.
In square brackets the vast majority of special characters can be used without escaping:
In square brackets we can use the vast majority of special characters without escaping:
- A dot `pattern:'.'`.
- A plus `pattern:'+'`.
- Parentheses `pattern:'( )'`.
- Dash `pattern:'-'` in the beginning or the end (where it does not define a range).
- A caret `pattern:'^'` if not in the beginning (where it means exclusion).
- And the opening square bracket `pattern:'['`.
- Symbols `pattern:. + ( )` never need escaping.
- A hyphen `pattern:-` is not escaped in the beginning or the end (where it does not define a range).
- A caret `pattern:^` is only escaped in the beginning (where it means exclusion).
- The closing square bracket `pattern:]` is always escaped (if we need to look for that symbol).
In other words, all special characters are allowed except where they mean something for square brackets.
In other words, all special characters are allowed without escaping, except when they mean something for square brackets.
A dot `"."` inside square brackets means just a dot. The pattern `pattern:[.,]` would look for one of characters: either a dot or a comma.
A dot `.` inside square brackets means just a dot. The pattern `pattern:[.,]` would look for one of characters: either a dot or a comma.
In the example below the regexp `pattern:[-().^+]` looks for one of the characters `-().^+`:
@ -112,3 +143,55 @@ let reg = /[\-\(\)\.\^\+]/g;
alert( "1 + 2 - 3".match(reg) ); // also works: +, -
```
## Ranges and flag "u"
If there are surrogate pairs in the set, flag `pattern:u` is required for them to work correctly.
For instance, let's look for `pattern:[𝒳𝒴]` in the string `subject:𝒳`:
```js run
alert( '𝒳'.match(/[𝒳𝒴]/) ); // shows a strange character, like [?]
// (the search was performed incorrectly, half-character returned)
```
The result is incorrect, because by default regular expressions "don't know" about surrogate pairs.
The regular expression engine thinks that `[𝒳𝒴]` -- are not two, but four characters:
1. left half of `𝒳` `(1)`,
2. right half of `𝒳` `(2)`,
3. left half of `𝒴` `(3)`,
4. right half of `𝒴` `(4)`.
We can see their codes like this:
```js run
for(let i=0; i<'𝒳𝒴'.length; i++) {
alert('𝒳𝒴'.charCodeAt(i)); // 55349, 56499, 55349, 56500
};
```
So, the example above finds and shows the left half of `𝒳`.
If we add flag `pattern:u`, then the behavior will be correct:
```js run
alert( '𝒳'.match(/[𝒳𝒴]/u) ); // 𝒳
```
The similar situation occurs when looking for a range, such as `[𝒳-𝒴]`.
If we forget to add flag `pattern:u`, there will be an error:
```js run
'𝒳'.match(/[𝒳-𝒴]/); // Error: Invalid regular expression
```
The reason is that without flag `pattern:u` surrogate pairs are perceived as two characters, so `[𝒳-𝒴]` is interpreted as `[<55349><56499>-<55349><56500>]` (every surrogate pair is replaced with its codes). Now it's easy to see that the range `56499-55349` is invalid: its starting code `56499` is greater than the end `55349`. That's the formal reason for the error.
With the flag `pattern:u` the pattern works correctly:
```js run
// look for characters from 𝒳 to 𝒵
alert( '𝒴'.match(/[𝒳-𝒵]/u) ); // 𝒴
```